Learning Reward Functions by Integrating Human Demonstrations and Preferences

Dorsa Sadigh; Gleb Shevchuk; Malayandi Palan; Nicholas C. Landolfi

arxiv: 1906.08928 · v1 · pith:ZAVCFJOTnew · submitted 2019-06-21 · 💻 cs.RO · cs.AI

Learning Reward Functions by Integrating Human Demonstrations and Preferences

Malayandi Palan , Nicholas C. Landolfi , Gleb Shevchuk , Dorsa Sadigh This is my paper

Pith reviewed 2026-05-25 19:18 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords reward learninginverse reinforcement learningpreference-based learningactive learninghuman-robot interactiondemonstrations and preferences

0 comments

The pith

DemPref learns robot reward functions by using demonstrations as a coarse prior to reduce and ground active preference queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that reward learning for robots becomes both accurate and efficient when demonstrations first supply a broad prior over possible reward functions and then anchor the generation of targeted preference questions. A reader would care because standard inverse reinforcement learning often fails with the imperfect demonstrations people actually give, while pure preference-based methods demand too many binary comparisons to work in high-dimensional spaces. The hybrid approach therefore shrinks the space of queries and improves their relevance without depending only on the demonstrations.

Core claim

DemPref first fits a coarse prior over the reward function space from the available demonstrations, thereby shrinking the region from which active preference queries are drawn, and then uses those same demonstrations to ground the query-generation process so that the resulting trajectory pairs are more informative. This combination removes the main efficiency bottleneck of preference-only learning while avoiding exclusive reliance on low-quality demonstrations.

What carries the argument

The DemPref framework that derives a coarse prior from demonstrations and uses it to constrain and ground active preference query selection.

If this is right

DemPref requires significantly fewer preference queries than standard active preference-based learning to reach comparable reward accuracy.
Robots trained with DemPref are rated by users as more successful at reproducing desired behavior than robots trained with standard IRL.
Users express a clear preference for using the DemPref interface over a standard IRL interface when teaching a robot.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prior-plus-grounding pattern could be tested in settings where the initial signal is even noisier, such as raw joystick inputs rather than full trajectories.
If the prior is learned from very few demonstrations, the method might still work when the preference queries are allowed to update the prior itself rather than treating it as fixed.
The efficiency gain could translate to lower cognitive load on the human teacher, measurable by total time or number of interactions needed to reach acceptable robot performance.

Load-bearing premise

Demonstrations supply a coarse prior that narrows the query space effectively even when the demonstrations themselves are of typical low quality.

What would settle it

An experiment in which DemPref requires at least as many preference queries as a pure active preference method to reach the same reward accuracy, or a user study in which participants do not rate DemPref-trained robots higher or prefer the system less than standard IRL.

Figures

Figures reproduced from arXiv: 1906.08928 by Dorsa Sadigh, Gleb Shevchuk, Malayandi Palan, Nicholas C. Landolfi.

**Figure 2.** Figure 2: The full DemPref framework. The human user provides demonstrations, which are used to learn a prior over reward functions. Then, we actively [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Views from each domain, with a demonstration in orange: (a.) Driver, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The results of our first experiment, investigating whether initializing with demonstrations improves convergence of the algorithm, on all three domains. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The results of our second experiment, investigating whether our rank [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: (Left) Our testing domain, with two trajectories generated according to the reward functions learnt by IRL and DemPref from a specific user in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Our goal is to accurately and efficiently learn reward functions for autonomous robots. Current approaches to this problem include inverse reinforcement learning (IRL), which uses expert demonstrations, and preference-based learning, which iteratively queries the user for her preferences between trajectories. In robotics however, IRL often struggles because it is difficult to get high-quality demonstrations; conversely, preference-based learning is very inefficient since it attempts to learn a continuous, high-dimensional function from binary feedback. We propose a new framework for reward learning, DemPref, that uses both demonstrations and preference queries to learn a reward function. Specifically, we (1) use the demonstrations to learn a coarse prior over the space of reward functions, to reduce the effective size of the space from which queries are generated; and (2) use the demonstrations to ground the (active) query generation process, to improve the quality of the generated queries. Our method alleviates the efficiency issues faced by standard preference-based learning methods and does not exclusively depend on (possibly low-quality) demonstrations. In numerical experiments, we find that DemPref is significantly more efficient than a standard active preference-based learning method. In a user study, we compare our method to a standard IRL method; we find that users rated the robot trained with DemPref as being more successful at learning their desired behavior, and preferred to use the DemPref system (over IRL) to train the robot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DemPref is a practical hybrid that uses demos to shrink and ground preference queries, with efficiency claims that need the actual numbers to evaluate.

read the letter

The paper's main contribution is a framework called DemPref that feeds demonstrations into preference-based reward learning in two concrete ways: first to build a coarse prior over rewards that shrinks the space for generating queries, and second to anchor the active query selection so the questions are more useful. This directly targets the usual problems with each method alone—IRL falling apart on noisy real-world demos, and preference learning burning through too many binary queries on a high-dimensional reward function. The numerical experiments report that it requires fewer queries than standard active preference learning, and the user study finds participants rated the DemPref-trained robot higher and preferred the system over plain IRL. That combination is the actual new piece; prior work has mixed the two signals but not with this explicit prior-plus-grounding construction. The approach is honest about not needing perfect demos, which matches robotics reality. The experiments and user study give it some empirical footing beyond pure theory. The soft spots are mostly about missing detail rather than outright flaws. The abstract states efficiency gains and user preference without reporting the magnitude, error bars, or exact statistical tests, so it is hard to judge whether the improvement is large enough to matter in practice. The central assumption—that the coarse prior from demonstrations reliably reduces the query space without being derailed by typical demo noise—could be stress-tested more explicitly, even though the method is built to fall back on preferences. The user study is a plus but would benefit from clearer protocol on how trajectories were shown and how success was measured. Overall this is aimed at people working on reward learning for robots or human-robot interaction who already know the limitations of IRL and preference methods. A reader in that area would get a usable recipe and some evidence it helps. The work is coherent enough on its own terms to deserve a serious referee who can check the implementation and run the numbers.

Referee Report

3 major / 2 minor

Summary. The paper proposes DemPref, a hybrid reward learning framework for robotics that integrates expert demonstrations with active preference queries. Demonstrations are used in two ways: (1) to induce a coarse prior over the reward function space that shrinks the effective query space, and (2) to ground the generation of preference queries. The method is evaluated in numerical experiments against a standard active preference baseline and in a user study against standard IRL, with claims of significantly higher efficiency and higher user preference ratings for the learned behavior.

Significance. If the efficiency and user-study results hold under scrutiny, the work offers a practical way to combine the strengths of IRL and preference-based methods while avoiding their primary weaknesses. The explicit design choice to treat demonstrations only as a coarse prior (rather than requiring high-quality trajectories) is a clear strength, and the numerical efficiency gains are directly tied to the proposed query-space reduction mechanism.

major comments (3)

[§5.1] §5.1 (numerical experiments): the claim that DemPref is 'significantly more efficient' rests on query counts, but the text does not report variance across random seeds, statistical significance tests, or the exact number of independent runs; without these, it is impossible to assess whether the reported reduction in queries is robust or could be explained by variance in the active-learning baseline.
[§6] §6 (user study): the preference ratings and success scores are presented as aggregate means, but the manuscript does not describe the exact questionnaire items, the scale used, or whether the order of exposure to DemPref vs. IRL was counterbalanced; these details are load-bearing for the claim that users 'preferred to use the DemPref system'.
[§4.1] §4.1 (prior construction): the mapping from demonstrations to the coarse prior is described at a high level, but the precise parameterization of the prior (e.g., how the feature weights or variance are set from the demonstrations) is not given an explicit equation or algorithm box; this makes it difficult to verify that the prior truly reduces the query space without introducing bias when demonstrations are noisy.

minor comments (2)

[Figure 3] Figure 3 caption and axis labels should explicitly state the number of trials or seeds used to generate the plotted curves.
[§2] The related-work section cites several preference-learning papers but does not discuss recent hybrid IRL+preference methods that appeared after 2018; a brief comparison paragraph would strengthen the positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§5.1] §5.1 (numerical experiments): the claim that DemPref is 'significantly more efficient' rests on query counts, but the text does not report variance across random seeds, statistical significance tests, or the exact number of independent runs; without these, it is impossible to assess whether the reported reduction in queries is robust or could be explained by variance in the active-learning baseline.

Authors: We agree that variance, statistical tests, and run counts are necessary to substantiate the efficiency claims. The original experiments were run with 20 independent random seeds; we will add error bars, report standard deviations, and include paired t-test p-values comparing DemPref against the active preference baseline in the revised §5.1. revision: yes
Referee: [§6] §6 (user study): the preference ratings and success scores are presented as aggregate means, but the manuscript does not describe the exact questionnaire items, the scale used, or whether the order of exposure to DemPref vs. IRL was counterbalanced; these details are load-bearing for the claim that users 'preferred to use the DemPref system'.

Authors: We will expand §6 to list the exact questionnaire items (e.g., 'The robot learned my intended behavior'), specify the 7-point Likert scale, and state that exposure order was counterbalanced across participants via Latin square design. These details were collected but omitted for brevity; their inclusion will strengthen the user-study claims. revision: yes
Referee: [§4.1] §4.1 (prior construction): the mapping from demonstrations to the coarse prior is described at a high level, but the precise parameterization of the prior (e.g., how the feature weights or variance are set from the demonstrations) is not given an explicit equation or algorithm box; this makes it difficult to verify that the prior truly reduces the query space without introducing bias when demonstrations are noisy.

Authors: We will add an explicit equation and algorithm box in §4.1 showing that the prior mean is set to the feature weights recovered by maximum-likelihood IRL on the demonstrations and the prior covariance is a scaled identity matrix whose scale is chosen as a fixed hyperparameter (0.1 in our experiments). This parameterization is intentionally coarse to avoid over-reliance on potentially noisy demonstrations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via new hybrid method and experiments

full rationale

The paper introduces DemPref as a novel combination of IRL-style demonstrations (for coarse prior and query grounding) with active preference queries. The central claims rest on explicit numerical efficiency comparisons against standard active preference learning and a user study against standard IRL; neither reduces to a fitted parameter renamed as prediction nor to any self-citation chain. No equations or uniqueness theorems are invoked that collapse back to the inputs by construction. The method is explicitly described as not relying exclusively on demonstrations, preserving independent content in the empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities. The method builds on standard IRL and preference learning assumptions, but details are not provided.

pith-pipeline@v0.9.0 · 5788 in / 1201 out tokens · 68395 ms · 2026-05-25T19:18:49.165028+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration
cs.LG 2026-02 unverdicted novelty 6.0

OPRIDE improves query efficiency in offline PbRL via a principled in-dataset exploration strategy and discount scheduling, outperforming prior methods with fewer queries and providing theoretical guarantees.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Apprenticeship learn- ing via inverse reinforcement learning

Pieter Abbeel and Andrew Y Ng. Apprenticeship learn- ing via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, page 1. ACM, 2004

work page 2004
[2]

An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity

Nir Ailon. An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. Journal of Machine Learning Research , 13 (Jan):137–164, 2012

work page 2012
[3]

April: Active preference learning-based reinforcement learning

Riad Akrour, Marc Schoenauer, and Mich `ele Sebag. April: Active preference learning-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages 116–131. Springer, 2012

work page 2012
[4]

A survey of robot learning from demonstration

Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems , 57 (5):469–483, 2009

work page 2009
[5]

Learning robot objectives from physical human interaction

Andrea Bajcsy, Dylan P Losey, Marcia K OMalley, and Anca D Dragan. Learning robot objectives from physical human interaction. In Conference on Robot Learning , pages 217–226, 2017

work page 2017
[6]

Do you want your autonomous car to drive like you? In Proceedings of the 2017 ACM/IEEE International Conference on Human- Robot Interaction, pages 417–425

Chandrayee Basu, Qian Yang, David Hungerman, Mukesh Singhal, and Anca D Dragan. Do you want your autonomous car to drive like you? In Proceedings of the 2017 ACM/IEEE International Conference on Human- Robot Interaction, pages 417–425. ACM, 2017

work page 2017
[7]

Batch Active Preference-Based Learning of Reward Functions

Erdem Bıyık and Dorsa Sadigh. Batch active preference- based learning of reward functions. arXiv preprint arXiv:1810.04303, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Distributed optimization and statistical learning via the alternating direction method of multipliers

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R⃝ in Machine learning, 3(1):1–122, 2011

work page 2011
[9]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Risk-Aware Active Inverse Reinforcement Learning

Daniel S Brown, Yuchen Cui, and Scott Niekum. Risk- aware active inverse reinforcement learning. arXiv preprint arXiv:1901.02161, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[11]

Make the table/big block in fetch environ- ments ﬁxed., 2018

Joy Chopra. Make the table/big block in fetch environ- ments ﬁxed., 2018. URL https://github.com/openai/gym/ issues/920

work page 2018
[12]

Active reward learning from critiques

Yuchen Cui and Scott Niekum. Active reward learning from critiques. In 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 6907–6914. IEEE, 2018

work page 2018
[13]

Openai baselines

Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017

work page 2017
[14]

Generating legible motion

Anca Dragan and Siddhartha Srinivasa. Generating legible motion. 2013

work page 2013
[15]

Formalizing assistive teleoperation

Anca D Dragan and Siddhartha S Srinivasa. Formalizing assistive teleoperation. MIT Press, July, 2012

work page 2012
[16]

One-shot imitation learning

Yan Duan, Marcin Andrychowicz, Bradly Stadie, Ope- nAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in neural information processing systems, pages 1087–1098, 2017

work page 2017
[17]

Active preference learning with discrete choice data

Brochu Eric, Nando D Freitas, and Abhijeet Ghosh. Active preference learning with discrete choice data. In Advances in neural information processing systems , pages 409–416, 2008

work page 2008
[18]

Guided cost learning: Deep inverse optimal control via policy optimization

Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016

work page 2016
[19]

Preference-based reinforcement learning: a formal framework and a policy iteration algorithm

Johannes F ¨urnkranz, Eyke H ¨ullermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89(1-2):123–156, 2012

work page 2012
[20]

Exploring voting blocs within the irish electorate: A mixture modeling approach

Isobel Claire Gormley and Thomas Brendan Murphy. Exploring voting blocs within the irish electorate: A mixture modeling approach. Journal of the American Statistical Association, 103(483):1014–1027, 2008

work page 2008
[21]

Bayesian inference for plackett-luce ranking models

John Guiver and Edward Snelson. Bayesian inference for plackett-luce ranking models. In proceedings of the 26th annual international conference on machine learning , pages 377–384. ACM, 2009

work page 2009
[22]

Inverse reward design

Dylan Hadﬁeld-Menell, Smitha Milli, Pieter Abbeel, Stu- art J Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems , pages 6765–6774, 2017

work page 2017
[23]

Generative adversarial imitation learning

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016

work page 2016
[24]

Active comparison based learning incorporating user uncertainty and noise

Rachel Holladay, Shervin Javdani, Anca Dragan, and Siddhartha Srinivasa. Active comparison based learning incorporating user uncertainty and noise. In RSS Work- shop on Model Learning for Human-Robot Communica- tion, 2016

work page 2016
[25]

Reward learning from human preferences and demonstrations in atari

Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. In Ad- vances in Neural Information Processing Systems , pages 8022–8034, 2018

work page 2018
[26]

Learning preferences for manipulation tasks from online coactive feedback

Ashesh Jain, Shikhar Sharma, Thorsten Joachims, and Ashutosh Saxena. Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research , 34(10):1296–1313, 2015

work page 2015
[27]

Shared Autonomy via Hindsight Optimization

Shervin Javdani, Siddhartha S Srinivasa, and J Andrew Bagnell. Shared autonomy via hindsight optimization. arXiv preprint arXiv:1503.07619 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

URL https://twitter.com/mat kelcey/ status/886101319559335936

Mat Kelcey, 2017. URL https://twitter.com/mat kelcey/ status/886101319559335936

work page arXiv 2017
[29]

Data-driven motion mappings improve transparency in teleoperation

Rebecca P Khurshid and Katherine J Kuchenbecker. Data-driven motion mappings improve transparency in teleoperation. Presence: Teleoperators and Virtual Envi- ronments, 24(2):132–154, 2015

work page 2015
[30]

Continuous Inverse Optimal Control with Locally Optimal Examples

Sergey Levine and Vladlen Koltun. Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[31]

Individual choice behavior: A theoret- ical analysis

R Duncan Luce. Individual choice behavior: A theoret- ical analysis. Courier Corporation, 2012

work page 2012
[32]

An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning

Dhruv Malik, Malayandi Palaniappan, Jaime F Fisac, Dylan Hadﬁeld-Menell, Stuart Russell, and Anca D Dra- gan. An efﬁcient, generalized bellman update for co- operative inverse reinforcement learning. arXiv preprint arXiv:1806.03820, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

Conditional logit analysis of qualitative choice behavior

Daniel McFadden et al. Conditional logit analysis of qualitative choice behavior. 1973

work page 1973
[34]

The analysis of permutations

Robin L Plackett. The analysis of permutations. Applied Statistics, pages 193–202, 1975

work page 1975
[35]

Bayesian inverse reinforcement learning

Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana, 51(61801):1–4, 2007

work page 2007
[36]

Maximum margin planning

Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning , pages 729–736. ACM, 2006

work page 2006
[37]

Simplifying Reward Design through Divide-and-Conquer

Ellis Ratner, Dylan Hadﬁeld-Menell, and Anca D Dra- gan. Simplifying reward design through divide-and- conquer. arXiv preprint arXiv:1806.02501 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

A reduction of imitation learning and structured prediction to no-regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the four- teenth international conference on artiﬁcial intelligence and statistics, pages 627–635, 2011

work page 2011
[39]

Active preference-based learning of reward functions

Dorsa Sadigh, Anca D Dragan, Shankar Sastry, and Sanjit A Seshia. Active preference-based learning of reward functions

work page
[40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Preference-learning based inverse reinforcement learning for dialog control

Hiroaki Sugiyama, Toyomi Meguro, and Yasuhiro Mi- nami. Preference-learning based inverse reinforcement learning for dialog control. In Thirteenth Annual Con- ference of the International Speech Communication As- sociation, 2012

work page 2012
[42]

Integrating reinforcement learning with human demonstrations of varying ability

Matthew E Taylor, Halit Bener Suay, and Sonia Cher- nova. Integrating reinforcement learning with human demonstrations of varying ability. In The 10th Interna- tional Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 617–624. International Foun- dation for Autonomous Agents and Multiagent Systems, 2011

work page 2011
[43]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ Interna- tional Conference on , pages 5026–5033. IEEE, 2012

work page 2012
[44]

A bayesian approach for policy learning from trajectory preference queries

Aaron Wilson, Alan Fern, and Prasad Tadepalli. A bayesian approach for policy learning from trajectory preference queries. In Advances in neural information processing systems, pages 1133–1141, 2012

work page 2012
[45]

Fetch and freight: Standard platforms for service robot applications

Melonee Wise, Michael Ferguson, Derek King, Eric Diehr, and David Dymesich. Fetch and freight: Standard platforms for service robot applications. In Workshop on Autonomous Mobile Service Robots , 2016

work page 2016
[46]

Learning Reward Functions by Integrating Human Demonstrations and Preferences

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. APPENDIX This appendix complements the RSS 2019 paper, “Learning Reward Functions by Integrating Human Demonstrations and Preferences”. A. Supplemental Videos We have provided the fo...

work page 2008
[47]

Demo.mov: This video shows a user teleoperating the robot using the keyboard interface

work page
[48]

Note that this pair of trajectories is clearly querying the user for whether she wants the robot arm to move towards the goal or away from the goal

PrefT1.mov and PrefT2.mov: These two videos show a preference query (two trajectories) generated by our system. Note that this pair of trajectories is clearly querying the user for whether she wants the robot arm to move towards the goal or away from the goal. Additionally, note the jaggedness of the trajectory: this is due to the highly non-convex nature...

work page
[49]

RolloutDemPref.mov: This video shows a sample trajectory generated by PPO, according to the reward function learned by DemPref (from a speciﬁc user). (In reality, the robot arm does get fairly close to the goal; we intentionally kept the table much lower when rolling-out behavior on the real robot to prevent collisions between the robot and the table. Use...

work page
[50]

main”, “update func

RolloutIRL.mov: This video shows a sample trajec- tory generated by PPO, according to the reward function learned by IRL (from the same user as above). Note the extremely poor performance of the robot – this is discussed in Section VII. B. Code The repository for this project is provided at the following link https://github.com/malayandi/DemPrefCode. Depe...

work page
[51]

As discussed in Section VI, we chose a “true” weight vector for each domain to use in our simulation experiments

Number of samples used in Monte Carlo approximation to objective in (6): 50,000 True Reward Function. As discussed in Section VI, we chose a “true” weight vector for each domain to use in our simulation experiments. We chose a weight vector that seemed reasonable in each domain. No tuning was performed. The weight vector for each domain is as follows:

work page
[52]

Driver: [0.5, -0.2, 0.2, -0.7]

work page
[53]

Lunar Lander: [-0.4, 0.4, -0.2, -0.7]

work page
[54]

Fetch Reach: [-0.6, -0.3, 0.9] Any further experimental details not found here can be found in the provided code

work page

[1] [1]

Apprenticeship learn- ing via inverse reinforcement learning

Pieter Abbeel and Andrew Y Ng. Apprenticeship learn- ing via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, page 1. ACM, 2004

work page 2004

[2] [2]

An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity

Nir Ailon. An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. Journal of Machine Learning Research , 13 (Jan):137–164, 2012

work page 2012

[3] [3]

April: Active preference learning-based reinforcement learning

Riad Akrour, Marc Schoenauer, and Mich `ele Sebag. April: Active preference learning-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages 116–131. Springer, 2012

work page 2012

[4] [4]

A survey of robot learning from demonstration

Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems , 57 (5):469–483, 2009

work page 2009

[5] [5]

Learning robot objectives from physical human interaction

Andrea Bajcsy, Dylan P Losey, Marcia K OMalley, and Anca D Dragan. Learning robot objectives from physical human interaction. In Conference on Robot Learning , pages 217–226, 2017

work page 2017

[6] [6]

Do you want your autonomous car to drive like you? In Proceedings of the 2017 ACM/IEEE International Conference on Human- Robot Interaction, pages 417–425

Chandrayee Basu, Qian Yang, David Hungerman, Mukesh Singhal, and Anca D Dragan. Do you want your autonomous car to drive like you? In Proceedings of the 2017 ACM/IEEE International Conference on Human- Robot Interaction, pages 417–425. ACM, 2017

work page 2017

[7] [7]

Batch Active Preference-Based Learning of Reward Functions

Erdem Bıyık and Dorsa Sadigh. Batch active preference- based learning of reward functions. arXiv preprint arXiv:1810.04303, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Distributed optimization and statistical learning via the alternating direction method of multipliers

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R⃝ in Machine learning, 3(1):1–122, 2011

work page 2011

[9] [9]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Risk-Aware Active Inverse Reinforcement Learning

Daniel S Brown, Yuchen Cui, and Scott Niekum. Risk- aware active inverse reinforcement learning. arXiv preprint arXiv:1901.02161, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[11] [11]

Make the table/big block in fetch environ- ments ﬁxed., 2018

Joy Chopra. Make the table/big block in fetch environ- ments ﬁxed., 2018. URL https://github.com/openai/gym/ issues/920

work page 2018

[12] [12]

Active reward learning from critiques

Yuchen Cui and Scott Niekum. Active reward learning from critiques. In 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 6907–6914. IEEE, 2018

work page 2018

[13] [13]

Openai baselines

Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017

work page 2017

[14] [14]

Generating legible motion

Anca Dragan and Siddhartha Srinivasa. Generating legible motion. 2013

work page 2013

[15] [15]

Formalizing assistive teleoperation

Anca D Dragan and Siddhartha S Srinivasa. Formalizing assistive teleoperation. MIT Press, July, 2012

work page 2012

[16] [16]

One-shot imitation learning

Yan Duan, Marcin Andrychowicz, Bradly Stadie, Ope- nAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in neural information processing systems, pages 1087–1098, 2017

work page 2017

[17] [17]

Active preference learning with discrete choice data

Brochu Eric, Nando D Freitas, and Abhijeet Ghosh. Active preference learning with discrete choice data. In Advances in neural information processing systems , pages 409–416, 2008

work page 2008

[18] [18]

Guided cost learning: Deep inverse optimal control via policy optimization

Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016

work page 2016

[19] [19]

Preference-based reinforcement learning: a formal framework and a policy iteration algorithm

Johannes F ¨urnkranz, Eyke H ¨ullermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89(1-2):123–156, 2012

work page 2012

[20] [20]

Exploring voting blocs within the irish electorate: A mixture modeling approach

Isobel Claire Gormley and Thomas Brendan Murphy. Exploring voting blocs within the irish electorate: A mixture modeling approach. Journal of the American Statistical Association, 103(483):1014–1027, 2008

work page 2008

[21] [21]

Bayesian inference for plackett-luce ranking models

John Guiver and Edward Snelson. Bayesian inference for plackett-luce ranking models. In proceedings of the 26th annual international conference on machine learning , pages 377–384. ACM, 2009

work page 2009

[22] [22]

Inverse reward design

Dylan Hadﬁeld-Menell, Smitha Milli, Pieter Abbeel, Stu- art J Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems , pages 6765–6774, 2017

work page 2017

[23] [23]

Generative adversarial imitation learning

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016

work page 2016

[24] [24]

Active comparison based learning incorporating user uncertainty and noise

Rachel Holladay, Shervin Javdani, Anca Dragan, and Siddhartha Srinivasa. Active comparison based learning incorporating user uncertainty and noise. In RSS Work- shop on Model Learning for Human-Robot Communica- tion, 2016

work page 2016

[25] [25]

Reward learning from human preferences and demonstrations in atari

Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. In Ad- vances in Neural Information Processing Systems , pages 8022–8034, 2018

work page 2018

[26] [26]

Learning preferences for manipulation tasks from online coactive feedback

Ashesh Jain, Shikhar Sharma, Thorsten Joachims, and Ashutosh Saxena. Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research , 34(10):1296–1313, 2015

work page 2015

[27] [27]

Shared Autonomy via Hindsight Optimization

Shervin Javdani, Siddhartha S Srinivasa, and J Andrew Bagnell. Shared autonomy via hindsight optimization. arXiv preprint arXiv:1503.07619 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

URL https://twitter.com/mat kelcey/ status/886101319559335936

Mat Kelcey, 2017. URL https://twitter.com/mat kelcey/ status/886101319559335936

work page arXiv 2017

[29] [29]

Data-driven motion mappings improve transparency in teleoperation

Rebecca P Khurshid and Katherine J Kuchenbecker. Data-driven motion mappings improve transparency in teleoperation. Presence: Teleoperators and Virtual Envi- ronments, 24(2):132–154, 2015

work page 2015

[30] [30]

Continuous Inverse Optimal Control with Locally Optimal Examples

Sergey Levine and Vladlen Koltun. Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[31] [31]

Individual choice behavior: A theoret- ical analysis

R Duncan Luce. Individual choice behavior: A theoret- ical analysis. Courier Corporation, 2012

work page 2012

[32] [32]

An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning

Dhruv Malik, Malayandi Palaniappan, Jaime F Fisac, Dylan Hadﬁeld-Menell, Stuart Russell, and Anca D Dra- gan. An efﬁcient, generalized bellman update for co- operative inverse reinforcement learning. arXiv preprint arXiv:1806.03820, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[33] [33]

Conditional logit analysis of qualitative choice behavior

Daniel McFadden et al. Conditional logit analysis of qualitative choice behavior. 1973

work page 1973

[34] [34]

The analysis of permutations

Robin L Plackett. The analysis of permutations. Applied Statistics, pages 193–202, 1975

work page 1975

[35] [35]

Bayesian inverse reinforcement learning

Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana, 51(61801):1–4, 2007

work page 2007

[36] [36]

Maximum margin planning

Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning , pages 729–736. ACM, 2006

work page 2006

[37] [37]

Simplifying Reward Design through Divide-and-Conquer

Ellis Ratner, Dylan Hadﬁeld-Menell, and Anca D Dra- gan. Simplifying reward design through divide-and- conquer. arXiv preprint arXiv:1806.02501 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [38]

A reduction of imitation learning and structured prediction to no-regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the four- teenth international conference on artiﬁcial intelligence and statistics, pages 627–635, 2011

work page 2011

[39] [39]

Active preference-based learning of reward functions

Dorsa Sadigh, Anca D Dragan, Shankar Sastry, and Sanjit A Seshia. Active preference-based learning of reward functions

work page

[40] [40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Preference-learning based inverse reinforcement learning for dialog control

Hiroaki Sugiyama, Toyomi Meguro, and Yasuhiro Mi- nami. Preference-learning based inverse reinforcement learning for dialog control. In Thirteenth Annual Con- ference of the International Speech Communication As- sociation, 2012

work page 2012

[42] [42]

Integrating reinforcement learning with human demonstrations of varying ability

Matthew E Taylor, Halit Bener Suay, and Sonia Cher- nova. Integrating reinforcement learning with human demonstrations of varying ability. In The 10th Interna- tional Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 617–624. International Foun- dation for Autonomous Agents and Multiagent Systems, 2011

work page 2011

[43] [43]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ Interna- tional Conference on , pages 5026–5033. IEEE, 2012

work page 2012

[44] [44]

A bayesian approach for policy learning from trajectory preference queries

Aaron Wilson, Alan Fern, and Prasad Tadepalli. A bayesian approach for policy learning from trajectory preference queries. In Advances in neural information processing systems, pages 1133–1141, 2012

work page 2012

[45] [45]

Fetch and freight: Standard platforms for service robot applications

Melonee Wise, Michael Ferguson, Derek King, Eric Diehr, and David Dymesich. Fetch and freight: Standard platforms for service robot applications. In Workshop on Autonomous Mobile Service Robots , 2016

work page 2016

[46] [46]

Learning Reward Functions by Integrating Human Demonstrations and Preferences

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. APPENDIX This appendix complements the RSS 2019 paper, “Learning Reward Functions by Integrating Human Demonstrations and Preferences”. A. Supplemental Videos We have provided the fo...

work page 2008

[47] [47]

Demo.mov: This video shows a user teleoperating the robot using the keyboard interface

work page

[48] [48]

Note that this pair of trajectories is clearly querying the user for whether she wants the robot arm to move towards the goal or away from the goal

PrefT1.mov and PrefT2.mov: These two videos show a preference query (two trajectories) generated by our system. Note that this pair of trajectories is clearly querying the user for whether she wants the robot arm to move towards the goal or away from the goal. Additionally, note the jaggedness of the trajectory: this is due to the highly non-convex nature...

work page

[49] [49]

RolloutDemPref.mov: This video shows a sample trajectory generated by PPO, according to the reward function learned by DemPref (from a speciﬁc user). (In reality, the robot arm does get fairly close to the goal; we intentionally kept the table much lower when rolling-out behavior on the real robot to prevent collisions between the robot and the table. Use...

work page

[50] [50]

main”, “update func

RolloutIRL.mov: This video shows a sample trajec- tory generated by PPO, according to the reward function learned by IRL (from the same user as above). Note the extremely poor performance of the robot – this is discussed in Section VII. B. Code The repository for this project is provided at the following link https://github.com/malayandi/DemPrefCode. Depe...

work page

[51] [51]

As discussed in Section VI, we chose a “true” weight vector for each domain to use in our simulation experiments

Number of samples used in Monte Carlo approximation to objective in (6): 50,000 True Reward Function. As discussed in Section VI, we chose a “true” weight vector for each domain to use in our simulation experiments. We chose a weight vector that seemed reasonable in each domain. No tuning was performed. The weight vector for each domain is as follows:

work page

[52] [52]

Driver: [0.5, -0.2, 0.2, -0.7]

work page

[53] [53]

Lunar Lander: [-0.4, 0.4, -0.2, -0.7]

work page

[54] [54]

Fetch Reach: [-0.6, -0.3, 0.9] Any further experimental details not found here can be found in the provided code

work page