Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations
Pith reviewed 2026-05-25 05:34 UTC · model grok-4.3
The pith
Robots recover misaligned rewards by detecting underspecified features via demonstration variation and soliciting targeted natural language corrections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Demonstrations implicitly reveal which features are well specified through low variation across examples, while high variation indicates underspecified features that create reward ambiguity. The robot leverages this signal to generate natural language explanations of its uncertainty and actively queries for corrective demonstrations that resolve those gaps, thereby reducing misalignment that would otherwise persist from imperfect initial data.
What carries the argument
Variation across demonstrations as a statistical signal for underspecified features, paired with natural language explanations to elicit targeted corrective demonstrations.
If this is right
- Reward functions learned this way exhibit reduced ambiguity on features that initially varied widely.
- Robot behavior aligns more closely with intended preferences in deployment situations not covered by the original demonstrations.
- The method yields measurable gains in both simulated manipulation domains and physical robot interactions with human users.
- Targeted queries outperform both random questioning and reliance on the initial imperfect demonstrations alone.
Where Pith is reading between the lines
- The variation signal could be combined with other uncertainty measures to prioritize queries even more efficiently.
- The approach may generalize to non-robotics settings where humans give demonstrations or feedback and language clarification is available.
- Fewer total demonstrations might suffice overall because queries focus human effort on high-variation features rather than spreading it evenly.
Load-bearing premise
Statistical variation across demonstrations reliably indicates which features humans left underspecified, and natural language queries will produce demonstrations that correctly resolve those gaps.
What would settle it
A controlled comparison in which explanation-guided queries produce no measurable improvement in reward alignment metrics over random querying or passive collection would falsify the claim.
Figures
read the original abstract
Learning reward functions from demonstrations assumes that demonstrations provide adequate supervision over all features -- or task-relevant aspects of behavior. In practice, demonstrations are often imperfect: humans may under-emphasize certain features due to cognitive load or physical difficulty, or the training regime may fail to sufficiently cover all relevant situations. In either case, important features may be underspecified, leading to ambiguity in the learned reward function and misaligned behavior at deployment. We propose a framework that detects such underspecified features and actively solicits targeted corrective demonstrations. Our key insight is that demonstrations implicitly reveal which features are well specified: features that are consistently optimized show little variation across demonstrations, while features that are underspecified vary widely. We leverage this statistical signal to infer which features may have been insufficiently demonstrated. The robot then explains which features it is uncertain about in natural language and queries for demonstrations that explicitly address the identified gaps. We evaluate our approach in a simulated tabletop manipulation domain and in a user study with a real Franka robot. Targeted, explanation-guided queries significantly improve reward recovery compared to random querying and passive data collection, reducing ambiguity that would otherwise persist in learning from imperfect demonstrations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that reward functions learned from demonstrations often suffer from underspecified features due to imperfect human input, and proposes detecting these via statistical variation (low variation = well-specified, high variation = underspecified). The robot then generates natural-language explanations of its uncertainty and solicits targeted demonstrations to resolve the gaps. Empirical results in a simulated tabletop domain and a real Franka robot user study show that this targeted querying improves reward recovery over random querying and passive collection.
Significance. If the central result holds, the work provides a concrete mechanism for active resolution of ambiguity in inverse reinforcement learning from imperfect demonstrations, a practical issue in human-robot interaction. The real-robot user study and comparison to both random and passive baselines are strengths that ground the claim in usable settings. The natural-language query interface is a positive contribution to interpretability.
major comments (1)
- [Abstract and §3] Abstract (key insight paragraph) and §3 (method description of feature detection): the claim that variation across demonstrations reliably indicates underspecification (vs. sensor/execution noise, multiple equally optimal policies, or unrelated human suboptimality) is load-bearing for the query-selection step. No mechanism, control experiment, or statistical test is described to distinguish these cases, so the inferred uncertainty set and subsequent improvement may not be attributable to the proposed signal.
minor comments (1)
- [Experiments section] Figure captions and axis labels in the experimental results could more explicitly state the number of trials, statistical test used, and exact definition of 'reward recovery' metric for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract (key insight paragraph) and §3 (method description of feature detection): the claim that variation across demonstrations reliably indicates underspecification (vs. sensor/execution noise, multiple equally optimal policies, or unrelated human suboptimality) is load-bearing for the query-selection step. No mechanism, control experiment, or statistical test is described to distinguish these cases, so the inferred uncertainty set and subsequent improvement may not be attributable to the proposed signal.
Authors: We agree that variation across demonstrations is not a unique indicator of underspecification and may also arise from sensor or execution noise, multiple equally optimal policies, or other forms of human suboptimality. The manuscript presents variation as a practical statistical proxy for identifying features that may require additional supervision, without providing an explicit mechanism, control experiment, or statistical test to isolate the underlying cause. The empirical improvements over random querying and passive collection are demonstrated in the tabletop domain and Franka user study, but these results do not rule out alternative explanations for the observed variation. In the revised manuscript we will add explicit discussion of this assumption and its limitations in §3, including a dedicated paragraph on potential confounding factors and their implications for the uncertainty set. revision: yes
Circularity Check
No circularity: central claim uses direct statistical signal from input demonstrations
full rationale
The paper's key step infers underspecified features from variation across demonstrations as an empirical observation ('features that are consistently optimized show little variation across demonstrations, while features that are underspecified vary widely'), without any quoted equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its own inputs by construction. The framework then uses this signal for natural-language queries and evaluates against random/passive baselines in simulation and user studies, remaining self-contained without load-bearing self-references or ansatz smuggling. This matches the most common honest finding of no circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Demonstrations implicitly reveal which features are well-specified through low variation across examples
- domain assumption Natural language explanations can effectively elicit targeted corrective demonstrations from users
Reference graph
Works this paper leans on
-
[1]
Pieter Abbeel and Andrew Y . Ng. Apprenticeship learn- ing via inverse reinforcement learning. InProceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, page 1, New York, NY , USA,
-
[2]
ISBN 1581138385.DOI: 10.1145/1015330.1015430
Association for Computing Machinery. ISBN 1581138385.DOI: 10.1145/1015330.1015430
-
[3]
Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz
Saleema Amershi, Dan Weld, Mihaela V orvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. Guidelines for human-ai interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, page 1–13, New York, NY , USA, 2019...
-
[4]
Learning robot objectives from physical human interaction
Andrea Bajcsy, Dylan P Losey, Marcia K O’malley, and Anca D Dragan. Learning robot objectives from physical human interaction. InConference on robot learning, pages 217–226. PMLR, 2017. URL http://proceedings. mlr.press/v78/bajcsy17a.html
work page 2017
-
[5]
Andrea Bajcsy, Dylan P. Losey, Marcia K. O’Malley, and Anca D. Dragan. Learning from physical human corrections, one feature at a time. InProceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, HRI 2018, Chicago, IL, USA, March 05-08, 2018, pages 141–149. ACM, 2018.DOI: 10.1145/3171221.3171267
-
[6]
Goal inference as inverse planning
Chris L Baker, Joshua B Tenenbaum, and Rebecca R Saxe. Goal inference as inverse planning. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 29, 2007
work page 2007
-
[7]
Fitting linear mixed-effects models using lme4
Douglas Bates, Martin M ¨achler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.DOI: 10.18637/jss.v067.i01
-
[8]
Inverse reinforce- ment learning by estimating expertise of demonstrators
Mark Beliaev and Ramtin Pedarsani. Inverse reinforce- ment learning by estimating expertise of demonstrators. InAAAI-25, Sponsored by the Association for the Ad- vancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 15532–15540. AAAI Press, 2025.DOI: 10.1609/AAAI.V39I15.33705
-
[9]
Data quality in imitation learning
Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Data quality in imitation learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc. URL http://papers.nips.cc/paper files/paper/2023/hash/ fe692980c5d9732cf153ce27947653a7-Abstract-Conference. html
work page 2023
-
[10]
A. Bobu, A. Bajcsy, J. F. Fisac, and A. D. Dragan. Learning under misspecified objective spaces. InCon- ference on Robot Learning (CoRL), 2018. URL http: //proceedings.mlr.press/v87/bobu18a.html
work page 2018
-
[11]
A. Bobu, A. Bajcsy, J. F. Fisac, S. Deglurkar, and A. D. Dragan. Quantifying hypothesis space misspecification in learning from human–robot demonstrations and physical corrections.Transactions on Robotics (T-RO), 2020.DOI: 10.1109/TRO.2020.2971415
-
[12]
Andreea Bobu, Marius Wiggert, Claire Tomlin, and Anca D Dragan. Inducing structure in reward learn- ing by learning features.The International Jour- nal of Robotics Research, 41(5):497–518, 2022.DOI: 10.1177/02783649221078031
-
[13]
Brown, Wonjoon Goo, and Scott Niekum
Daniel S. Brown, Wonjoon Goo, and Scott Niekum. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In3rd Annual Conference on Robot Learning, CoRL 2019, Osaka, Japan, October 30 - November 1, 2019, Proceedings, volume 100 ofProceedings of Machine Learning Research, pages 330–359. PMLR, 2019. URL http://proceedings.mlr.pre...
work page 2019
-
[14]
Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum
Daniel S. Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum. Safe imitation learning via fast bayesian reward inference from preferences. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofProceedings of Machine Learning Research, pages 1165–1177. PMLR, 2020. URL http://p...
work page 2020
-
[15]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S ´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Har- sha Nori, Hamid Palangi, Marco T ´ulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4.CoRR, abs/2303.12712, 2023. DOI: 10.48550/ARXIV .2303.12712
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023
-
[16]
Maya Cakmak and Andrea L. Thomaz. Optimal- ity of human teachers for robot learners. In2010 IEEE 9th International Conference on Development and Learning, pages 64–69, 2010.DOI: 10.1109/DE- VLRN.2010.5578865
work page doi:10.1109/de- 2010
-
[17]
Maya Cakmak and Andrea L. Thomaz. Designing robot learners that ask good questions. InProceedings of the Seventh Annual ACM/IEEE International Con- ference on Human-Robot Interaction, HRI ’12, page 17–24, New York, NY , USA, 2012. Association for Computing Machinery. ISBN 9781450310635.DOI: 10.1145/2157689.2157693
-
[18]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4299–4307, 2017. URL https://proceedings.neuri...
work page 2017
-
[19]
Guided cost learning: Deep inverse optimal control via policy optimization
Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. InICML, 2016. URL http://proceedings. mlr.press/v48/finn16.html
work page 2016
-
[20]
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes,
Jaime F. Fisac, Andrea Bajcsy, Sylvia L. Herbert, David Fridovich-Keil, Steven Wang, Claire J. Tomlin, and Anca D. Dragan. Probabilistically safe robot planning with confidence-based human predictions. InRobotics: Science and Systems XIV , Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, June 26-30, 2018, 2018. DOI: 10.15607/RSS.2018.XIV .069
-
[21]
Morris Gu, Elizabeth Croft, and Dana Kuli ´c. Demonstra- tion based explainable ai for learning from demonstration methods.IEEE Robotics and Automation Letters, 10(7): 6552–6559, 2025.DOI: 10.1109/LRA.2025.3568617
-
[22]
Soheil Habibian, Antonio Alvarez Valdivia, Laura H. Blumenschein, and Dylan P. Losey. A survey of commu- nicating robot learning during human-robot interaction. The International Journal of Robotics Research, 44(4): 665–698, 2025.DOI: 10.1177/02783649241281369
-
[23]
Sandra G. Hart and Lowell E. Staveland. Development of nasa-tlx (task load index): Results of empirical and theoretical research.Advances in Psychology, 52:139– 183, 1988.DOI: 10.1016/S0166-4115(08)62386-9
-
[24]
Huang, David Held, Pieter Abbeel, and Anca D
Sandy H. Huang, David Held, Pieter Abbeel, and Anca D. Dragan. Enabling robots to communicate their objectives.Auton. Robots, 43(2):309–326, 2019.DOI: 10.1007/S10514-018-9771-0
-
[25]
Masked irl: Llm-guided re- ward disambiguation from demonstrations and language,
Minyoung Hwang, Alexandra Forsey-Smerek, Nathaniel Dennler, and Andreea Bobu. Masked irl: Llm-guided re- ward disambiguation from demonstrations and language,
- [26]
-
[27]
E. T. Jaynes. Information theory and statistical me- chanics.Phys. Rev., 106:620–630, May 1957.DOI: 10.1103/PhysRev.106.620
-
[28]
Alexandra Kuznetsova, Per B. Brockhoff, and Rune H. B. Christensen. lmertest package: Tests in linear mixed effects models.Journal of Statistical Software, 82(13): 1–26, 2017.DOI: 10.18637/jss.v082.i13
-
[29]
Minae Kwon, Sandy H. Huang, and Anca D. Dragan. Expressing robot incapability. InProceedings of the 2018 ACM/IEEE International Conference on Human- Robot Interaction, HRI ’18, page 87–95, New York, NY , USA, 2018. Association for Computing Machinery. ISBN 9781450349536.DOI: 10.1145/3171221.3171276
-
[30]
Lenth and Julia Piaskowski.emmeans: Esti- mated Marginal Means, aka Least-Squares Means, 2025
Russell V . Lenth and Julia Piaskowski.emmeans: Esti- mated Marginal Means, aka Least-Squares Means, 2025. URL https://rvlenth.github.io/emmeans/. R package ver- sion 2.0.1
work page 2025
-
[31]
Chang Liu, Jessica B. Hamrick, Jaime F. Fisac, Anca D. Dragan, J. Karl Hedrick, S. Shankar Sastry, and Thomas L. Griffiths. Goal inference improves objective and perceived performance in human-robot collaboration. InProceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, Singapore, May 9-13, 2016, pages 940–948. ACM, 20...
work page 2016
-
[32]
Andrew Y . Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In Pat Langley, editor, Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 663–
work page 2000
-
[33]
Morgan Kaufmann, 2000
work page 2000
-
[34]
Ho, Tianmin Shu, Andreea Bobu, Julie Shah, and Pulkit Agrawal
Andi Peng, Aviv Netanyahu, Mark K. Ho, Tianmin Shu, Andreea Bobu, Julie Shah, and Pulkit Agrawal. Diagnosis, feedback, adaptation: A human-in-the-loop framework for test-time policy adaptation. InInterna- tional Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research, pages 2...
work page 2023
-
[35]
Andi Peng, Andreea Bobu, Belinda Z. Li, Theodore R. Sumers, Ilia Sucholutsky, Nishanth Kumar, Thomas L. Griffiths, and Julie A. Shah. Preference-conditioned language-guided abstraction. InProceedings of the 2024 ACM/IEEE International Conference on Human- Robot Interaction, HRI 2024, Boulder, CO, USA, March 11-15, 2024, pages 572–581. ACM, 2024.DOI: 10.11...
-
[36]
Li, Ilia Sucholutsky, Nishanth Kumar, Julie Shah, Jacob Andreas, and Andreea Bobu
Andi Peng, Belinda Z. Li, Ilia Sucholutsky, Nishanth Kumar, Julie Shah, Jacob Andreas, and Andreea Bobu. Adaptive language-guided abstraction from contrastive explanations. InConference on Robot Learning, 6- 9 November 2024, Munich, Germany, volume 270 of Proceedings of Machine Learning Research, pages 3425–
work page 2024
-
[37]
URL https://proceedings.mlr.press/ v270/peng25c.html
PMLR, 2024. URL https://proceedings.mlr.press/ v270/peng25c.html
work page 2024
-
[38]
Daniel Rakita, Bilge Mutlu, and Michael Gleicher. An autonomous dynamic camera method for effective remote teleoperation. InProceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, HRI 2018, Chicago, IL, USA, March 05-08, 2018, pages 325–333. ACM, 2018.DOI: 10.1145/3171221.3171279
-
[39]
Polydoros, Sonia Chernova, and Aude Billard
Harish Ravichandar, Athanasios S. Polydoros, Sonia Chernova, and Aude Billard. Recent advances in robot learning from demonstration.Annu. Rev. Con- trol. Robotics Auton. Syst., 3:297–330, 2020.DOI: 10.1146/ANNUREV-CONTROL-100819-063206
-
[40]
Dragan, Shankar Sastry, and Sanjit A
Dorsa Sadigh, Anca D. Dragan, Shankar Sastry, and Sanjit A. Seshia. Active preference-based learning of reward functions. InRobotics: Science and Systems XIII, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA, July 12-16, 2017, 2017.DOI: 10.15607/RSS.2017.XIII.053
-
[41]
Maram Sakr, Juyan Zhang, H. F. Machiel Van der Loos, Dana Kuli ´c, and Elizabeth Croft. Consistency matters: Defining demonstration data quality metrics in robot learning from demonstration.J. Hum.-Robot Interact., 15(2), December 2025.DOI: 10.1145/3773904
-
[42]
Correcting robot plans with natural language feedback
Pratyusha Sharma, Balakumar Sundaralingam, Valts Blukis, Chris Paxton, Tucker Hermans, Antonio Torralba, Jacob Andreas, and Dieter Fox. Correcting robot plans with natural language feedback. InRobotics: Science and Systems XVIII, New York City, NY, USA, June 27 - July 1, 2022, 2022.DOI: 10.15607/RSS.2022.XVIII.065
-
[43]
Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2):257– 285, 1988
John Sweller. Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2):257– 285, 1988. ISSN 0364-0213.DOI: 10.1016/0364- 0213(88)90023-7
-
[44]
Princeton University Press Princeton, NJ, 1945
John V on Neumann and Oskar Morgenstern.Theory of games and economic behavior. Princeton University Press Princeton, NJ, 1945
work page 1945
-
[45]
Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforce- ment learning. InAAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. APPENDIX A. Implementation Details Reference Variance Distributions.To construct the reference distributionsP(σ 2 i |o i = 0, ϕ i)andP(σ 2 i |o i = 1, ϕ i)for each featureϕ i ∈ϕ,...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.