pith. sign in

arxiv: 1906.08928 · v1 · pith:ZAVCFJOTnew · submitted 2019-06-21 · 💻 cs.RO · cs.AI

Learning Reward Functions by Integrating Human Demonstrations and Preferences

Pith reviewed 2026-05-25 19:18 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords reward learninginverse reinforcement learningpreference-based learningactive learninghuman-robot interactiondemonstrations and preferences
0
0 comments X

The pith

DemPref learns robot reward functions by using demonstrations as a coarse prior to reduce and ground active preference queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that reward learning for robots becomes both accurate and efficient when demonstrations first supply a broad prior over possible reward functions and then anchor the generation of targeted preference questions. A reader would care because standard inverse reinforcement learning often fails with the imperfect demonstrations people actually give, while pure preference-based methods demand too many binary comparisons to work in high-dimensional spaces. The hybrid approach therefore shrinks the space of queries and improves their relevance without depending only on the demonstrations.

Core claim

DemPref first fits a coarse prior over the reward function space from the available demonstrations, thereby shrinking the region from which active preference queries are drawn, and then uses those same demonstrations to ground the query-generation process so that the resulting trajectory pairs are more informative. This combination removes the main efficiency bottleneck of preference-only learning while avoiding exclusive reliance on low-quality demonstrations.

What carries the argument

The DemPref framework that derives a coarse prior from demonstrations and uses it to constrain and ground active preference query selection.

If this is right

  • DemPref requires significantly fewer preference queries than standard active preference-based learning to reach comparable reward accuracy.
  • Robots trained with DemPref are rated by users as more successful at reproducing desired behavior than robots trained with standard IRL.
  • Users express a clear preference for using the DemPref interface over a standard IRL interface when teaching a robot.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prior-plus-grounding pattern could be tested in settings where the initial signal is even noisier, such as raw joystick inputs rather than full trajectories.
  • If the prior is learned from very few demonstrations, the method might still work when the preference queries are allowed to update the prior itself rather than treating it as fixed.
  • The efficiency gain could translate to lower cognitive load on the human teacher, measurable by total time or number of interactions needed to reach acceptable robot performance.

Load-bearing premise

Demonstrations supply a coarse prior that narrows the query space effectively even when the demonstrations themselves are of typical low quality.

What would settle it

An experiment in which DemPref requires at least as many preference queries as a pure active preference method to reach the same reward accuracy, or a user study in which participants do not rate DemPref-trained robots higher or prefer the system less than standard IRL.

Figures

Figures reproduced from arXiv: 1906.08928 by Dorsa Sadigh, Gleb Shevchuk, Malayandi Palan, Nicholas C. Landolfi.

Figure 1
Figure 1. Figure 1: We show how to leverage multi-modal human input – both demon [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The full DemPref framework. The human user provides demonstrations, which are used to learn a prior over reward functions. Then, we actively [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Views from each domain, with a demonstration in orange: (a.) Driver, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The results of our first experiment, investigating whether initializing with demonstrations improves convergence of the algorithm, on all three domains. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The results of our second experiment, investigating whether our rank [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: (Left) Our testing domain, with two trajectories generated according to the reward functions learnt by IRL and DemPref from a specific user in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Our goal is to accurately and efficiently learn reward functions for autonomous robots. Current approaches to this problem include inverse reinforcement learning (IRL), which uses expert demonstrations, and preference-based learning, which iteratively queries the user for her preferences between trajectories. In robotics however, IRL often struggles because it is difficult to get high-quality demonstrations; conversely, preference-based learning is very inefficient since it attempts to learn a continuous, high-dimensional function from binary feedback. We propose a new framework for reward learning, DemPref, that uses both demonstrations and preference queries to learn a reward function. Specifically, we (1) use the demonstrations to learn a coarse prior over the space of reward functions, to reduce the effective size of the space from which queries are generated; and (2) use the demonstrations to ground the (active) query generation process, to improve the quality of the generated queries. Our method alleviates the efficiency issues faced by standard preference-based learning methods and does not exclusively depend on (possibly low-quality) demonstrations. In numerical experiments, we find that DemPref is significantly more efficient than a standard active preference-based learning method. In a user study, we compare our method to a standard IRL method; we find that users rated the robot trained with DemPref as being more successful at learning their desired behavior, and preferred to use the DemPref system (over IRL) to train the robot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes DemPref, a hybrid reward learning framework for robotics that integrates expert demonstrations with active preference queries. Demonstrations are used in two ways: (1) to induce a coarse prior over the reward function space that shrinks the effective query space, and (2) to ground the generation of preference queries. The method is evaluated in numerical experiments against a standard active preference baseline and in a user study against standard IRL, with claims of significantly higher efficiency and higher user preference ratings for the learned behavior.

Significance. If the efficiency and user-study results hold under scrutiny, the work offers a practical way to combine the strengths of IRL and preference-based methods while avoiding their primary weaknesses. The explicit design choice to treat demonstrations only as a coarse prior (rather than requiring high-quality trajectories) is a clear strength, and the numerical efficiency gains are directly tied to the proposed query-space reduction mechanism.

major comments (3)
  1. [§5.1] §5.1 (numerical experiments): the claim that DemPref is 'significantly more efficient' rests on query counts, but the text does not report variance across random seeds, statistical significance tests, or the exact number of independent runs; without these, it is impossible to assess whether the reported reduction in queries is robust or could be explained by variance in the active-learning baseline.
  2. [§6] §6 (user study): the preference ratings and success scores are presented as aggregate means, but the manuscript does not describe the exact questionnaire items, the scale used, or whether the order of exposure to DemPref vs. IRL was counterbalanced; these details are load-bearing for the claim that users 'preferred to use the DemPref system'.
  3. [§4.1] §4.1 (prior construction): the mapping from demonstrations to the coarse prior is described at a high level, but the precise parameterization of the prior (e.g., how the feature weights or variance are set from the demonstrations) is not given an explicit equation or algorithm box; this makes it difficult to verify that the prior truly reduces the query space without introducing bias when demonstrations are noisy.
minor comments (2)
  1. [Figure 3] Figure 3 caption and axis labels should explicitly state the number of trials or seeds used to generate the plotted curves.
  2. [§2] The related-work section cites several preference-learning papers but does not discuss recent hybrid IRL+preference methods that appeared after 2018; a brief comparison paragraph would strengthen the positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§5.1] §5.1 (numerical experiments): the claim that DemPref is 'significantly more efficient' rests on query counts, but the text does not report variance across random seeds, statistical significance tests, or the exact number of independent runs; without these, it is impossible to assess whether the reported reduction in queries is robust or could be explained by variance in the active-learning baseline.

    Authors: We agree that variance, statistical tests, and run counts are necessary to substantiate the efficiency claims. The original experiments were run with 20 independent random seeds; we will add error bars, report standard deviations, and include paired t-test p-values comparing DemPref against the active preference baseline in the revised §5.1. revision: yes

  2. Referee: [§6] §6 (user study): the preference ratings and success scores are presented as aggregate means, but the manuscript does not describe the exact questionnaire items, the scale used, or whether the order of exposure to DemPref vs. IRL was counterbalanced; these details are load-bearing for the claim that users 'preferred to use the DemPref system'.

    Authors: We will expand §6 to list the exact questionnaire items (e.g., 'The robot learned my intended behavior'), specify the 7-point Likert scale, and state that exposure order was counterbalanced across participants via Latin square design. These details were collected but omitted for brevity; their inclusion will strengthen the user-study claims. revision: yes

  3. Referee: [§4.1] §4.1 (prior construction): the mapping from demonstrations to the coarse prior is described at a high level, but the precise parameterization of the prior (e.g., how the feature weights or variance are set from the demonstrations) is not given an explicit equation or algorithm box; this makes it difficult to verify that the prior truly reduces the query space without introducing bias when demonstrations are noisy.

    Authors: We will add an explicit equation and algorithm box in §4.1 showing that the prior mean is set to the feature weights recovered by maximum-likelihood IRL on the demonstrations and the prior covariance is a scaled identity matrix whose scale is chosen as a fixed hyperparameter (0.1 in our experiments). This parameterization is intentionally coarse to avoid over-reliance on potentially noisy demonstrations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via new hybrid method and experiments

full rationale

The paper introduces DemPref as a novel combination of IRL-style demonstrations (for coarse prior and query grounding) with active preference queries. The central claims rest on explicit numerical efficiency comparisons against standard active preference learning and a user study against standard IRL; neither reduces to a fitted parameter renamed as prediction nor to any self-citation chain. No equations or uniqueness theorems are invoked that collapse back to the inputs by construction. The method is explicitly described as not relying exclusively on demonstrations, preserving independent content in the empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities. The method builds on standard IRL and preference learning assumptions, but details are not provided.

pith-pipeline@v0.9.0 · 5788 in / 1201 out tokens · 68395 ms · 2026-05-25T19:18:49.165028+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration

    cs.LG 2026-02 unverdicted novelty 6.0

    OPRIDE improves query efficiency in offline PbRL via a principled in-dataset exploration strategy and discount scheduling, outperforming prior methods with fewer queries and providing theoretical guarantees.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Apprenticeship learn- ing via inverse reinforcement learning

    Pieter Abbeel and Andrew Y Ng. Apprenticeship learn- ing via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004

  2. [2]

    An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity

    Nir Ailon. An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. Journal of Machine Learning Research , 13 (Jan):137–164, 2012

  3. [3]

    April: Active preference learning-based reinforcement learning

    Riad Akrour, Marc Schoenauer, and Mich `ele Sebag. April: Active preference learning-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages 116–131. Springer, 2012

  4. [4]

    A survey of robot learning from demonstration

    Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems , 57 (5):469–483, 2009

  5. [5]

    Learning robot objectives from physical human interaction

    Andrea Bajcsy, Dylan P Losey, Marcia K OMalley, and Anca D Dragan. Learning robot objectives from physical human interaction. In Conference on Robot Learning , pages 217–226, 2017

  6. [6]

    Do you want your autonomous car to drive like you? In Proceedings of the 2017 ACM/IEEE International Conference on Human- Robot Interaction, pages 417–425

    Chandrayee Basu, Qian Yang, David Hungerman, Mukesh Singhal, and Anca D Dragan. Do you want your autonomous car to drive like you? In Proceedings of the 2017 ACM/IEEE International Conference on Human- Robot Interaction, pages 417–425. ACM, 2017

  7. [7]

    Batch Active Preference-Based Learning of Reward Functions

    Erdem Bıyık and Dorsa Sadigh. Batch active preference- based learning of reward functions. arXiv preprint arXiv:1810.04303, 2018

  8. [8]

    Distributed optimization and statistical learning via the alternating direction method of multipliers

    Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R⃝ in Machine learning, 3(1):1–122, 2011

  9. [9]

    OpenAI Gym

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

  10. [10]

    Risk-Aware Active Inverse Reinforcement Learning

    Daniel S Brown, Yuchen Cui, and Scott Niekum. Risk- aware active inverse reinforcement learning. arXiv preprint arXiv:1901.02161, 2019

  11. [11]

    Make the table/big block in fetch environ- ments fixed., 2018

    Joy Chopra. Make the table/big block in fetch environ- ments fixed., 2018. URL https://github.com/openai/gym/ issues/920

  12. [12]

    Active reward learning from critiques

    Yuchen Cui and Scott Niekum. Active reward learning from critiques. In 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 6907–6914. IEEE, 2018

  13. [13]

    Openai baselines

    Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017

  14. [14]

    Generating legible motion

    Anca Dragan and Siddhartha Srinivasa. Generating legible motion. 2013

  15. [15]

    Formalizing assistive teleoperation

    Anca D Dragan and Siddhartha S Srinivasa. Formalizing assistive teleoperation. MIT Press, July, 2012

  16. [16]

    One-shot imitation learning

    Yan Duan, Marcin Andrychowicz, Bradly Stadie, Ope- nAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in neural information processing systems, pages 1087–1098, 2017

  17. [17]

    Active preference learning with discrete choice data

    Brochu Eric, Nando D Freitas, and Abhijeet Ghosh. Active preference learning with discrete choice data. In Advances in neural information processing systems , pages 409–416, 2008

  18. [18]

    Guided cost learning: Deep inverse optimal control via policy optimization

    Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016

  19. [19]

    Preference-based reinforcement learning: a formal framework and a policy iteration algorithm

    Johannes F ¨urnkranz, Eyke H ¨ullermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89(1-2):123–156, 2012

  20. [20]

    Exploring voting blocs within the irish electorate: A mixture modeling approach

    Isobel Claire Gormley and Thomas Brendan Murphy. Exploring voting blocs within the irish electorate: A mixture modeling approach. Journal of the American Statistical Association, 103(483):1014–1027, 2008

  21. [21]

    Bayesian inference for plackett-luce ranking models

    John Guiver and Edward Snelson. Bayesian inference for plackett-luce ranking models. In proceedings of the 26th annual international conference on machine learning , pages 377–384. ACM, 2009

  22. [22]

    Inverse reward design

    Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stu- art J Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems , pages 6765–6774, 2017

  23. [23]

    Generative adversarial imitation learning

    Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016

  24. [24]

    Active comparison based learning incorporating user uncertainty and noise

    Rachel Holladay, Shervin Javdani, Anca Dragan, and Siddhartha Srinivasa. Active comparison based learning incorporating user uncertainty and noise. In RSS Work- shop on Model Learning for Human-Robot Communica- tion, 2016

  25. [25]

    Reward learning from human preferences and demonstrations in atari

    Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. In Ad- vances in Neural Information Processing Systems , pages 8022–8034, 2018

  26. [26]

    Learning preferences for manipulation tasks from online coactive feedback

    Ashesh Jain, Shikhar Sharma, Thorsten Joachims, and Ashutosh Saxena. Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research , 34(10):1296–1313, 2015

  27. [27]

    Shared Autonomy via Hindsight Optimization

    Shervin Javdani, Siddhartha S Srinivasa, and J Andrew Bagnell. Shared autonomy via hindsight optimization. arXiv preprint arXiv:1503.07619 , 2015

  28. [28]

    URL https://twitter.com/mat kelcey/ status/886101319559335936

    Mat Kelcey, 2017. URL https://twitter.com/mat kelcey/ status/886101319559335936

  29. [29]

    Data-driven motion mappings improve transparency in teleoperation

    Rebecca P Khurshid and Katherine J Kuchenbecker. Data-driven motion mappings improve transparency in teleoperation. Presence: Teleoperators and Virtual Envi- ronments, 24(2):132–154, 2015

  30. [30]

    Continuous Inverse Optimal Control with Locally Optimal Examples

    Sergey Levine and Vladlen Koltun. Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617, 2012

  31. [31]

    Individual choice behavior: A theoret- ical analysis

    R Duncan Luce. Individual choice behavior: A theoret- ical analysis. Courier Corporation, 2012

  32. [32]

    An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning

    Dhruv Malik, Malayandi Palaniappan, Jaime F Fisac, Dylan Hadfield-Menell, Stuart Russell, and Anca D Dra- gan. An efficient, generalized bellman update for co- operative inverse reinforcement learning. arXiv preprint arXiv:1806.03820, 2018

  33. [33]

    Conditional logit analysis of qualitative choice behavior

    Daniel McFadden et al. Conditional logit analysis of qualitative choice behavior. 1973

  34. [34]

    The analysis of permutations

    Robin L Plackett. The analysis of permutations. Applied Statistics, pages 193–202, 1975

  35. [35]

    Bayesian inverse reinforcement learning

    Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana, 51(61801):1–4, 2007

  36. [36]

    Maximum margin planning

    Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning , pages 729–736. ACM, 2006

  37. [37]

    Simplifying Reward Design through Divide-and-Conquer

    Ellis Ratner, Dylan Hadfield-Menell, and Anca D Dra- gan. Simplifying reward design through divide-and- conquer. arXiv preprint arXiv:1806.02501 , 2018

  38. [38]

    A reduction of imitation learning and structured prediction to no-regret online learning

    St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the four- teenth international conference on artificial intelligence and statistics, pages 627–635, 2011

  39. [39]

    Active preference-based learning of reward functions

    Dorsa Sadigh, Anca D Dragan, Shankar Sastry, and Sanjit A Seshia. Active preference-based learning of reward functions

  40. [40]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

  41. [41]

    Preference-learning based inverse reinforcement learning for dialog control

    Hiroaki Sugiyama, Toyomi Meguro, and Yasuhiro Mi- nami. Preference-learning based inverse reinforcement learning for dialog control. In Thirteenth Annual Con- ference of the International Speech Communication As- sociation, 2012

  42. [42]

    Integrating reinforcement learning with human demonstrations of varying ability

    Matthew E Taylor, Halit Bener Suay, and Sonia Cher- nova. Integrating reinforcement learning with human demonstrations of varying ability. In The 10th Interna- tional Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 617–624. International Foun- dation for Autonomous Agents and Multiagent Systems, 2011

  43. [43]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ Interna- tional Conference on , pages 5026–5033. IEEE, 2012

  44. [44]

    A bayesian approach for policy learning from trajectory preference queries

    Aaron Wilson, Alan Fern, and Prasad Tadepalli. A bayesian approach for policy learning from trajectory preference queries. In Advances in neural information processing systems, pages 1133–1141, 2012

  45. [45]

    Fetch and freight: Standard platforms for service robot applications

    Melonee Wise, Michael Ferguson, Derek King, Eric Diehr, and David Dymesich. Fetch and freight: Standard platforms for service robot applications. In Workshop on Autonomous Mobile Service Robots , 2016

  46. [46]

    Learning Reward Functions by Integrating Human Demonstrations and Preferences

    Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. APPENDIX This appendix complements the RSS 2019 paper, “Learning Reward Functions by Integrating Human Demonstrations and Preferences”. A. Supplemental Videos We have provided the fo...

  47. [47]

    Demo.mov: This video shows a user teleoperating the robot using the keyboard interface

  48. [48]

    Note that this pair of trajectories is clearly querying the user for whether she wants the robot arm to move towards the goal or away from the goal

    PrefT1.mov and PrefT2.mov: These two videos show a preference query (two trajectories) generated by our system. Note that this pair of trajectories is clearly querying the user for whether she wants the robot arm to move towards the goal or away from the goal. Additionally, note the jaggedness of the trajectory: this is due to the highly non-convex nature...

  49. [49]

    RolloutDemPref.mov: This video shows a sample trajectory generated by PPO, according to the reward function learned by DemPref (from a specific user). (In reality, the robot arm does get fairly close to the goal; we intentionally kept the table much lower when rolling-out behavior on the real robot to prevent collisions between the robot and the table. Use...

  50. [50]

    main”, “update func

    RolloutIRL.mov: This video shows a sample trajec- tory generated by PPO, according to the reward function learned by IRL (from the same user as above). Note the extremely poor performance of the robot – this is discussed in Section VII. B. Code The repository for this project is provided at the following link https://github.com/malayandi/DemPrefCode. Depe...

  51. [51]

    As discussed in Section VI, we chose a “true” weight vector for each domain to use in our simulation experiments

    Number of samples used in Monte Carlo approximation to objective in (6): 50,000 True Reward Function. As discussed in Section VI, we chose a “true” weight vector for each domain to use in our simulation experiments. We chose a weight vector that seemed reasonable in each domain. No tuning was performed. The weight vector for each domain is as follows:

  52. [52]

    Driver: [0.5, -0.2, 0.2, -0.7]

  53. [53]

    Lunar Lander: [-0.4, 0.4, -0.2, -0.7]

  54. [54]

    Fetch Reach: [-0.6, -0.3, 0.9] Any further experimental details not found here can be found in the provided code