pith. sign in

arxiv: 2605.22986 · v1 · pith:IR72YOXHnew · submitted 2026-05-21 · 💻 cs.RO · cs.AI· cs.HC· cs.LG

Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations

Pith reviewed 2026-05-25 05:34 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.HCcs.LG
keywords reward learning from demonstrationsactive queryingnatural language explanationsunderspecified featuresrobot alignmenttargeted correctionsimperfect demonstrations
0
0 comments X

The pith

Robots recover misaligned rewards by detecting underspecified features via demonstration variation and soliciting targeted natural language corrections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that robots can identify which task features remain underspecified in human demonstrations by measuring statistical variation across those demonstrations. Low variation signals features the human consistently optimized; high variation flags gaps that leave the reward function ambiguous. The robot then describes its uncertainty in natural language and requests new demonstrations that explicitly address the identified features. Evaluations in a simulated tabletop manipulation task and a real-robot user study show these guided queries recover more accurate rewards than random queries or passive data collection alone.

Core claim

Demonstrations implicitly reveal which features are well specified through low variation across examples, while high variation indicates underspecified features that create reward ambiguity. The robot leverages this signal to generate natural language explanations of its uncertainty and actively queries for corrective demonstrations that resolve those gaps, thereby reducing misalignment that would otherwise persist from imperfect initial data.

What carries the argument

Variation across demonstrations as a statistical signal for underspecified features, paired with natural language explanations to elicit targeted corrective demonstrations.

If this is right

  • Reward functions learned this way exhibit reduced ambiguity on features that initially varied widely.
  • Robot behavior aligns more closely with intended preferences in deployment situations not covered by the original demonstrations.
  • The method yields measurable gains in both simulated manipulation domains and physical robot interactions with human users.
  • Targeted queries outperform both random questioning and reliance on the initial imperfect demonstrations alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The variation signal could be combined with other uncertainty measures to prioritize queries even more efficiently.
  • The approach may generalize to non-robotics settings where humans give demonstrations or feedback and language clarification is available.
  • Fewer total demonstrations might suffice overall because queries focus human effort on high-variation features rather than spreading it evenly.

Load-bearing premise

Statistical variation across demonstrations reliably indicates which features humans left underspecified, and natural language queries will produce demonstrations that correctly resolve those gaps.

What would settle it

A controlled comparison in which explanation-guided queries produce no measurable improvement in reward alignment metrics over random querying or passive collection would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.22986 by Andreea Bobu, Helena Merker, Nick Walker.

Figure 1
Figure 1. Figure 1: Humans don’t or can’t always attend to all important task features [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Experimental environments. (a) JacoRobot simulated environment, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reward recovery in JacoRobot when the initial demonstrations underspecify (a) a single feature or (b) two features. ASQ consistently outperforms the random baselines and approaches Oracle performance, recovering the true reward with fewer targeted demonstrations. Lines show mean normalized reward across 5 seeds and shaded regions denote standard error. weights θ ∗ and per-feature rationality coefficients β… view at source ↗
Figure 4
Figure 4. Figure 4: Reward recovery in JacoRobot in the 8-feature setting that augments the feature set of 4 task-relevant features with 4 irrelevant distractor objects, with (a) one task-relevant feature underspecified and (b) two task-relevant features underspecified. LLM-based filtering prunes the distractors before variance-based detection, allowing ASQ to focus queries on task-relevant features. ASQ matches or exceeds th… view at source ↗
Figure 5
Figure 5. Figure 5: Change in normalized reward by condition. Hashed indicates an [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Whether participants reported emphasizing the underspecified feature [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: GridRobot navigation simulated environment. Along with the JacoRobot en￾vironment, we evaluate our ap￾proach in the GridRobot envi￾ronment: a simulated discrete 2D navigation task. In the GridRobot domain, an agent must navigate from a start position to a goal po￾sition on a 5x5 grid while avoid￾ing an obstacle ( [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: GridRobot where the initial demonstrations have one underspecified [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-participant demonstration feature values across the three user-study conditions. Each column shows one task-relevant feature; rows correspond to [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: End-effector trajectories from all user study demonstrations. Each panel shows the three demonstrations provided by one participant (rows, P1-P12) [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Learning reward functions from demonstrations assumes that demonstrations provide adequate supervision over all features -- or task-relevant aspects of behavior. In practice, demonstrations are often imperfect: humans may under-emphasize certain features due to cognitive load or physical difficulty, or the training regime may fail to sufficiently cover all relevant situations. In either case, important features may be underspecified, leading to ambiguity in the learned reward function and misaligned behavior at deployment. We propose a framework that detects such underspecified features and actively solicits targeted corrective demonstrations. Our key insight is that demonstrations implicitly reveal which features are well specified: features that are consistently optimized show little variation across demonstrations, while features that are underspecified vary widely. We leverage this statistical signal to infer which features may have been insufficiently demonstrated. The robot then explains which features it is uncertain about in natural language and queries for demonstrations that explicitly address the identified gaps. We evaluate our approach in a simulated tabletop manipulation domain and in a user study with a real Franka robot. Targeted, explanation-guided queries significantly improve reward recovery compared to random querying and passive data collection, reducing ambiguity that would otherwise persist in learning from imperfect demonstrations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that reward functions learned from demonstrations often suffer from underspecified features due to imperfect human input, and proposes detecting these via statistical variation (low variation = well-specified, high variation = underspecified). The robot then generates natural-language explanations of its uncertainty and solicits targeted demonstrations to resolve the gaps. Empirical results in a simulated tabletop domain and a real Franka robot user study show that this targeted querying improves reward recovery over random querying and passive collection.

Significance. If the central result holds, the work provides a concrete mechanism for active resolution of ambiguity in inverse reinforcement learning from imperfect demonstrations, a practical issue in human-robot interaction. The real-robot user study and comparison to both random and passive baselines are strengths that ground the claim in usable settings. The natural-language query interface is a positive contribution to interpretability.

major comments (1)
  1. [Abstract and §3] Abstract (key insight paragraph) and §3 (method description of feature detection): the claim that variation across demonstrations reliably indicates underspecification (vs. sensor/execution noise, multiple equally optimal policies, or unrelated human suboptimality) is load-bearing for the query-selection step. No mechanism, control experiment, or statistical test is described to distinguish these cases, so the inferred uncertainty set and subsequent improvement may not be attributable to the proposed signal.
minor comments (1)
  1. [Experiments section] Figure captions and axis labels in the experimental results could more explicitly state the number of trials, statistical test used, and exact definition of 'reward recovery' metric for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract (key insight paragraph) and §3 (method description of feature detection): the claim that variation across demonstrations reliably indicates underspecification (vs. sensor/execution noise, multiple equally optimal policies, or unrelated human suboptimality) is load-bearing for the query-selection step. No mechanism, control experiment, or statistical test is described to distinguish these cases, so the inferred uncertainty set and subsequent improvement may not be attributable to the proposed signal.

    Authors: We agree that variation across demonstrations is not a unique indicator of underspecification and may also arise from sensor or execution noise, multiple equally optimal policies, or other forms of human suboptimality. The manuscript presents variation as a practical statistical proxy for identifying features that may require additional supervision, without providing an explicit mechanism, control experiment, or statistical test to isolate the underlying cause. The empirical improvements over random querying and passive collection are demonstrated in the tabletop domain and Franka user study, but these results do not rule out alternative explanations for the observed variation. In the revised manuscript we will add explicit discussion of this assumption and its limitations in §3, including a dedicated paragraph on potential confounding factors and their implications for the uncertainty set. revision: yes

Circularity Check

0 steps flagged

No circularity: central claim uses direct statistical signal from input demonstrations

full rationale

The paper's key step infers underspecified features from variation across demonstrations as an empirical observation ('features that are consistently optimized show little variation across demonstrations, while features that are underspecified vary widely'), without any quoted equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its own inputs by construction. The framework then uses this signal for natural-language queries and evaluates against random/passive baselines in simulation and user studies, remaining self-contained without load-bearing self-references or ansatz smuggling. This matches the most common honest finding of no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on the domain assumption that variation across demonstrations is a valid proxy for underspecification and that language-based queries will produce corrective data; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Demonstrations implicitly reveal which features are well-specified through low variation across examples
    Stated as the key insight enabling detection of underspecified features
  • domain assumption Natural language explanations can effectively elicit targeted corrective demonstrations from users
    Required for the querying component of the framework to function

pith-pipeline@v0.9.0 · 5745 in / 1144 out tokens · 26662 ms · 2026-05-25T05:34:55.139014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

  1. [1]

    Pieter Abbeel and Andrew Y . Ng. Apprenticeship learn- ing via inverse reinforcement learning. InProceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, page 1, New York, NY , USA,

  2. [2]

    ISBN 1581138385.DOI: 10.1145/1015330.1015430

    Association for Computing Machinery. ISBN 1581138385.DOI: 10.1145/1015330.1015430

  3. [3]

    Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz

    Saleema Amershi, Dan Weld, Mihaela V orvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. Guidelines for human-ai interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, page 1–13, New York, NY , USA, 2019...

  4. [4]

    Learning robot objectives from physical human interaction

    Andrea Bajcsy, Dylan P Losey, Marcia K O’malley, and Anca D Dragan. Learning robot objectives from physical human interaction. InConference on robot learning, pages 217–226. PMLR, 2017. URL http://proceedings. mlr.press/v78/bajcsy17a.html

  5. [5]

    Losey, Marcia K

    Andrea Bajcsy, Dylan P. Losey, Marcia K. O’Malley, and Anca D. Dragan. Learning from physical human corrections, one feature at a time. InProceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, HRI 2018, Chicago, IL, USA, March 05-08, 2018, pages 141–149. ACM, 2018.DOI: 10.1145/3171221.3171267

  6. [6]

    Goal inference as inverse planning

    Chris L Baker, Joshua B Tenenbaum, and Rebecca R Saxe. Goal inference as inverse planning. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 29, 2007

  7. [7]

    Fitting linear mixed-effects models using lme4

    Douglas Bates, Martin M ¨achler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.DOI: 10.18637/jss.v067.i01

  8. [8]

    Inverse reinforce- ment learning by estimating expertise of demonstrators

    Mark Beliaev and Ramtin Pedarsani. Inverse reinforce- ment learning by estimating expertise of demonstrators. InAAAI-25, Sponsored by the Association for the Ad- vancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 15532–15540. AAAI Press, 2025.DOI: 10.1609/AAAI.V39I15.33705

  9. [9]

    Data quality in imitation learning

    Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Data quality in imitation learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc. URL http://papers.nips.cc/paper files/paper/2023/hash/ fe692980c5d9732cf153ce27947653a7-Abstract-Conference. html

  10. [10]

    A. Bobu, A. Bajcsy, J. F. Fisac, and A. D. Dragan. Learning under misspecified objective spaces. InCon- ference on Robot Learning (CoRL), 2018. URL http: //proceedings.mlr.press/v87/bobu18a.html

  11. [11]

    A. Bobu, A. Bajcsy, J. F. Fisac, S. Deglurkar, and A. D. Dragan. Quantifying hypothesis space misspecification in learning from human–robot demonstrations and physical corrections.Transactions on Robotics (T-RO), 2020.DOI: 10.1109/TRO.2020.2971415

  12. [12]

    Inducing structure in reward learn- ing by learning features.The International Jour- nal of Robotics Research, 41(5):497–518, 2022.DOI: 10.1177/02783649221078031

    Andreea Bobu, Marius Wiggert, Claire Tomlin, and Anca D Dragan. Inducing structure in reward learn- ing by learning features.The International Jour- nal of Robotics Research, 41(5):497–518, 2022.DOI: 10.1177/02783649221078031

  13. [13]

    Brown, Wonjoon Goo, and Scott Niekum

    Daniel S. Brown, Wonjoon Goo, and Scott Niekum. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In3rd Annual Conference on Robot Learning, CoRL 2019, Osaka, Japan, October 30 - November 1, 2019, Proceedings, volume 100 ofProceedings of Machine Learning Research, pages 330–359. PMLR, 2019. URL http://proceedings.mlr.pre...

  14. [14]

    Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum

    Daniel S. Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum. Safe imitation learning via fast bayesian reward inference from preferences. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofProceedings of Machine Learning Research, pages 1165–1177. PMLR, 2020. URL http://p...

  15. [15]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S ´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Har- sha Nori, Hamid Palangi, Marco T ´ulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4.CoRR, abs/2303.12712, 2023. DOI: 10.48550/ARXIV .2303.12712

  16. [16]

    Maya Cakmak and Andrea L. Thomaz. Optimal- ity of human teachers for robot learners. In2010 IEEE 9th International Conference on Development and Learning, pages 64–69, 2010.DOI: 10.1109/DE- VLRN.2010.5578865

  17. [17]

    Maya Cakmak and Andrea L. Thomaz. Designing robot learners that ask good questions. InProceedings of the Seventh Annual ACM/IEEE International Con- ference on Human-Robot Interaction, HRI ’12, page 17–24, New York, NY , USA, 2012. Association for Computing Machinery. ISBN 9781450310635.DOI: 10.1145/2157689.2157693

  18. [18]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4299–4307, 2017. URL https://proceedings.neuri...

  19. [19]

    Guided cost learning: Deep inverse optimal control via policy optimization

    Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. InICML, 2016. URL http://proceedings. mlr.press/v48/finn16.html

  20. [20]

    PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes,

    Jaime F. Fisac, Andrea Bajcsy, Sylvia L. Herbert, David Fridovich-Keil, Steven Wang, Claire J. Tomlin, and Anca D. Dragan. Probabilistically safe robot planning with confidence-based human predictions. InRobotics: Science and Systems XIV , Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, June 26-30, 2018, 2018. DOI: 10.15607/RSS.2018.XIV .069

  21. [21]

    Demonstra- tion based explainable ai for learning from demonstration methods.IEEE Robotics and Automation Letters, 10(7): 6552–6559, 2025.DOI: 10.1109/LRA.2025.3568617

    Morris Gu, Elizabeth Croft, and Dana Kuli ´c. Demonstra- tion based explainable ai for learning from demonstration methods.IEEE Robotics and Automation Letters, 10(7): 6552–6559, 2025.DOI: 10.1109/LRA.2025.3568617

  22. [22]

    Blumenschein, and Dylan P

    Soheil Habibian, Antonio Alvarez Valdivia, Laura H. Blumenschein, and Dylan P. Losey. A survey of commu- nicating robot learning during human-robot interaction. The International Journal of Robotics Research, 44(4): 665–698, 2025.DOI: 10.1177/02783649241281369

  23. [23]

    Hart and Lowell E

    Sandra G. Hart and Lowell E. Staveland. Development of nasa-tlx (task load index): Results of empirical and theoretical research.Advances in Psychology, 52:139– 183, 1988.DOI: 10.1016/S0166-4115(08)62386-9

  24. [24]

    Huang, David Held, Pieter Abbeel, and Anca D

    Sandy H. Huang, David Held, Pieter Abbeel, and Anca D. Dragan. Enabling robots to communicate their objectives.Auton. Robots, 43(2):309–326, 2019.DOI: 10.1007/S10514-018-9771-0

  25. [25]

    Masked irl: Llm-guided re- ward disambiguation from demonstrations and language,

    Minyoung Hwang, Alexandra Forsey-Smerek, Nathaniel Dennler, and Andreea Bobu. Masked irl: Llm-guided re- ward disambiguation from demonstrations and language,

  26. [26]

    URL https://arxiv.org/abs/2511.14565

  27. [27]

    E. T. Jaynes. Information theory and statistical me- chanics.Phys. Rev., 106:620–630, May 1957.DOI: 10.1103/PhysRev.106.620

  28. [28]

    Brockhoff, and Rune H

    Alexandra Kuznetsova, Per B. Brockhoff, and Rune H. B. Christensen. lmertest package: Tests in linear mixed effects models.Journal of Statistical Software, 82(13): 1–26, 2017.DOI: 10.18637/jss.v082.i13

  29. [29]

    Huang, and Anca D

    Minae Kwon, Sandy H. Huang, and Anca D. Dragan. Expressing robot incapability. InProceedings of the 2018 ACM/IEEE International Conference on Human- Robot Interaction, HRI ’18, page 87–95, New York, NY , USA, 2018. Association for Computing Machinery. ISBN 9781450349536.DOI: 10.1145/3171221.3171276

  30. [30]

    Lenth and Julia Piaskowski.emmeans: Esti- mated Marginal Means, aka Least-Squares Means, 2025

    Russell V . Lenth and Julia Piaskowski.emmeans: Esti- mated Marginal Means, aka Least-Squares Means, 2025. URL https://rvlenth.github.io/emmeans/. R package ver- sion 2.0.1

  31. [31]

    Hamrick, Jaime F

    Chang Liu, Jessica B. Hamrick, Jaime F. Fisac, Anca D. Dragan, J. Karl Hedrick, S. Shankar Sastry, and Thomas L. Griffiths. Goal inference improves objective and perceived performance in human-robot collaboration. InProceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, Singapore, May 9-13, 2016, pages 940–948. ACM, 20...

  32. [32]

    Ng and Stuart Russell

    Andrew Y . Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In Pat Langley, editor, Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 663–

  33. [33]

    Morgan Kaufmann, 2000

  34. [34]

    Ho, Tianmin Shu, Andreea Bobu, Julie Shah, and Pulkit Agrawal

    Andi Peng, Aviv Netanyahu, Mark K. Ho, Tianmin Shu, Andreea Bobu, Julie Shah, and Pulkit Agrawal. Diagnosis, feedback, adaptation: A human-in-the-loop framework for test-time policy adaptation. InInterna- tional Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research, pages 2...

  35. [35]

    Li, Theodore R

    Andi Peng, Andreea Bobu, Belinda Z. Li, Theodore R. Sumers, Ilia Sucholutsky, Nishanth Kumar, Thomas L. Griffiths, and Julie A. Shah. Preference-conditioned language-guided abstraction. InProceedings of the 2024 ACM/IEEE International Conference on Human- Robot Interaction, HRI 2024, Boulder, CO, USA, March 11-15, 2024, pages 572–581. ACM, 2024.DOI: 10.11...

  36. [36]

    Li, Ilia Sucholutsky, Nishanth Kumar, Julie Shah, Jacob Andreas, and Andreea Bobu

    Andi Peng, Belinda Z. Li, Ilia Sucholutsky, Nishanth Kumar, Julie Shah, Jacob Andreas, and Andreea Bobu. Adaptive language-guided abstraction from contrastive explanations. InConference on Robot Learning, 6- 9 November 2024, Munich, Germany, volume 270 of Proceedings of Machine Learning Research, pages 3425–

  37. [37]

    URL https://proceedings.mlr.press/ v270/peng25c.html

    PMLR, 2024. URL https://proceedings.mlr.press/ v270/peng25c.html

  38. [38]

    Huang, and Anca D

    Daniel Rakita, Bilge Mutlu, and Michael Gleicher. An autonomous dynamic camera method for effective remote teleoperation. InProceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, HRI 2018, Chicago, IL, USA, March 05-08, 2018, pages 325–333. ACM, 2018.DOI: 10.1145/3171221.3171279

  39. [39]

    Polydoros, Sonia Chernova, and Aude Billard

    Harish Ravichandar, Athanasios S. Polydoros, Sonia Chernova, and Aude Billard. Recent advances in robot learning from demonstration.Annu. Rev. Con- trol. Robotics Auton. Syst., 3:297–330, 2020.DOI: 10.1146/ANNUREV-CONTROL-100819-063206

  40. [40]

    Dragan, Shankar Sastry, and Sanjit A

    Dorsa Sadigh, Anca D. Dragan, Shankar Sastry, and Sanjit A. Seshia. Active preference-based learning of reward functions. InRobotics: Science and Systems XIII, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA, July 12-16, 2017, 2017.DOI: 10.15607/RSS.2017.XIII.053

  41. [41]

    Maram Sakr, Juyan Zhang, H. F. Machiel Van der Loos, Dana Kuli ´c, and Elizabeth Croft. Consistency matters: Defining demonstration data quality metrics in robot learning from demonstration.J. Hum.-Robot Interact., 15(2), December 2025.DOI: 10.1145/3773904

  42. [42]

    Correcting robot plans with natural language feedback

    Pratyusha Sharma, Balakumar Sundaralingam, Valts Blukis, Chris Paxton, Tucker Hermans, Antonio Torralba, Jacob Andreas, and Dieter Fox. Correcting robot plans with natural language feedback. InRobotics: Science and Systems XVIII, New York City, NY, USA, June 27 - July 1, 2022, 2022.DOI: 10.15607/RSS.2022.XVIII.065

  43. [43]

    Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2):257– 285, 1988

    John Sweller. Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2):257– 285, 1988. ISSN 0364-0213.DOI: 10.1016/0364- 0213(88)90023-7

  44. [44]

    Princeton University Press Princeton, NJ, 1945

    John V on Neumann and Oskar Morgenstern.Theory of games and economic behavior. Princeton University Press Princeton, NJ, 1945

  45. [45]

    bad at this task

    Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforce- ment learning. InAAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. APPENDIX A. Implementation Details Reference Variance Distributions.To construct the reference distributionsP(σ 2 i |o i = 0, ϕ i)andP(σ 2 i |o i = 1, ϕ i)for each featureϕ i ∈ϕ,...