On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

Anca D. Dragan; Noah Gundotra; Pieter Abbeel; Rohin Shah

arxiv: 1906.09624 · v1 · pith:L2WVJCMFnew · submitted 2019-06-23 · 💻 cs.LG · cs.AI· stat.ML

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

Rohin Shah , Noah Gundotra , Pieter Abbeel , Anca D. Dragan This is my paper

Pith reviewed 2026-05-25 17:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords inverse reinforcement learningreward inferencehuman biasesdifferentiable plannerlearning from demonstrationsplanning approximationIRL

0 comments

The pith

Learning a demonstrator's planning process improves reward inference over wrong bias assumptions, but the switch from exact to differentiable planning hurts more than the gain helps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether replacing explicit assumptions about human biases in inverse reinforcement learning with a data-driven learned planner yields better reward functions. It trains a differentiable planner on demonstrations to capture whatever planning process the demonstrator actually uses, then compares the resulting reward inference quality against baselines that assume noisily optimal behavior or a specific wrong bias such as risk aversion. Results show a modest improvement from avoiding the wrong assumption, yet this gain is smaller than the degradation introduced by the planner's approximation error relative to an exact planner. The work concludes that reward inference currently needs an intermediate approach that retains useful bias structure while adding data-driven flexibility.

Core claim

Rather than assuming the expert is noisily optimal or has a known bias such as risk-aversion, the method learns the demonstrator's planning algorithm itself as a differentiable planner from the provided demonstrations. Experiments demonstrate that this learned planner produces better reward inference than an incorrect fixed bias assumption, but the performance loss incurred by replacing an exact planner with the learned differentiable version is substantially larger than the benefit obtained from avoiding the wrong assumption.

What carries the argument

A differentiable planner trained on demonstrations to stand in for the demonstrator's actual planning process inside the reward inference objective.

If this is right

Reward inference accuracy rises when the learned planner matches the demonstrator's actual process better than a mismatched explicit bias does.
An exact planner paired with the correct bias assumption produces the lowest inference error.
The approximation gap between exact and differentiable planners exceeds the penalty from using an incorrect bias.
Reward inference systems benefit from retaining some explicit structure rather than moving to fully learned planners.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid planners that embed limited learnable parameters inside an otherwise exact planning algorithm could reduce approximation error while still adapting to observed behavior.
Evaluating the same tradeoff on real human demonstrations rather than simulated biased planners would test whether the reported gap persists outside controlled settings.
Improving the training stability or capacity of differentiable planners might narrow the performance difference to exact planners enough to make bias learning advantageous.

Load-bearing premise

The error introduced by training and using a differentiable planner remains small enough that it does not overwhelm any improvement gained by avoiding a mismatched bias assumption.

What would settle it

An experiment in which reward inference error with the learned differentiable planner falls below the error obtained with an exact planner that uses the correct bias assumption would falsify the dominance of the approximation loss.

Figures

Figures reproduced from arXiv: 1906.09624 by Anca D. Dragan, Noah Gundotra, Pieter Abbeel, Rohin Shah.

**Figure 1.** Figure 1: While we could correct for systematic biases by having our AI system reason about explicit models of human reasoning, using the wrong assumption can lead to agents that do not correctly understand what people want. A natural alternative is to learn human biases from data. Our goal in this work is to investigate this alternative and gain insight into what additional assumptions might make it feasible and wh… view at source ↗

**Figure 2.** Figure 2: The plans of our synthetic agents on two navigation environments. Actual trajectories could differ due to randomness in the transitions. Green squares indicate positive reward while red squares indicate negative reward, with darker colors indicating higher magnitude of reward. sophisticated hyperbolic time discounters can be "tempted" by a proximate smaller reward. The naive agent fails to anticipate the … view at source ↗

**Figure 3.** Figure 3: The architecture and operations on it used in the algorithms. training jointly as in Equation TRAIN-JOINTLY. Since the reward inference requires a differentiable planner, we need a method that sets a differentiable planner to be optimal (that is, the planner that maximizes expected reward). This can be done by simulating data from an optimal agent with randomly generated world models and rewards, and use … view at source ↗

**Figure 4.** Figure 4: Reward obtained when planning with the inferred reward, as a percentage of the maximum possible reward, for different bias models and algorithms. We implement five types of biases by modifying the value iteration algorithm to produce a different set of Q-values. The top row shows results for agents that choose between the best actions from these Q-values, while in the bottom row the agent chooses actions w… view at source ↗

**Figure 5.** Figure 5: Percent reward obtained for different bias models using variations of Algorithm 2, which does not get access to any known rewards. These algorithms can vary along two dimensions – whether they are initialized with the assumption that the demonstrator is rational, and whether they train the planner and reward jointly or with coordinate ascent. The original version of Algorithm 2 does initialize, and trains … view at source ↗

read the original abstract

Our goal is for agents to optimize the right reward function, despite how difficult it is for us to specify what that is. Inverse Reinforcement Learning (IRL) enables us to infer reward functions from demonstrations, but it usually assumes that the expert is noisily optimal. Real people, on the other hand, often have systematic biases: risk-aversion, myopia, etc. One option is to try to characterize these biases and account for them explicitly during learning. But in the era of deep learning, a natural suggestion researchers make is to avoid mathematical models of human behavior that are fraught with specific assumptions, and instead use a purely data-driven approach. We decided to put this to the test -- rather than relying on assumptions about which specific bias the demonstrator has when planning, we instead learn the demonstrator's planning algorithm that they use to generate demonstrations, as a differentiable planner. Our exploration yielded mixed findings: on the one hand, learning the planner can lead to better reward inference than relying on the wrong assumption; on the other hand, this benefit is dwarfed by the loss we incur by going from an exact to a differentiable planner. This suggest that at least for the foreseeable future, agents need a middle ground between the flexibility of data-driven methods and the useful bias of known human biases. Code is available at https://tinyurl.com/learningbiases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Learning the planner beats a mismatched bias assumption in IRL but the switch to a differentiable planner costs more than it gains.

read the letter

The core result is that learning a differentiable planner from data improves reward inference over using the wrong fixed bias assumption, yet the performance drop from moving away from an exact planner is larger than that gain. The paper puts the data-driven suggestion to a direct test instead of leaving it as an unexamined hope. They run the comparison across environments and report the quantified trade-off, which is the actual new piece here. Releasing the code is also useful for anyone who wants to check or extend the setup. The experiments appear to back the mixed conclusion without forcing a stronger claim than the numbers support. The main limitation is that the net effect is still negative once you account for the approximation error, so the practical payoff for IRL work remains narrow. The work is aimed at researchers already doing inverse reinforcement learning or reward modeling from human demonstrations. It is a solid, contained empirical check rather than a broad theoretical advance, but the question it asks is relevant enough that it deserves referee time to verify the details and see if the environments and baselines hold up.

Referee Report

0 major / 2 minor

Summary. The paper investigates the feasibility of learning a differentiable planner from demonstrations as an alternative to explicitly assuming specific human biases (such as risk aversion or myopia) when performing reward inference via inverse reinforcement learning. Through empirical comparisons in multiple environments, it reports that learning the planner can outperform reward inference based on an incorrect bias assumption, but that the performance degradation from replacing an exact planner with a differentiable approximation is substantially larger than this benefit, leading to the conclusion that a middle ground between data-driven flexibility and known bias models is needed. Code is released.

Significance. If the reported comparisons hold under the chosen architectures and environments, the work supplies concrete empirical evidence on the relative costs of model mismatch versus approximation error in IRL, supporting the value of hybrid approaches. The release of code is a positive factor for reproducibility and further investigation of the trade-off.

minor comments (2)

[§4] §4 (Experiments): the quantitative results comparing learned planner vs. mismatched bias would benefit from explicit reporting of statistical significance or error bars across runs to strengthen the mixed-finding claim.
The environments and planner architectures are described, but a table summarizing all hyperparameter choices and baseline implementations would improve clarity for replication.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report correctly captures the paper's core empirical finding that learning a differentiable planner can improve reward inference relative to an incorrect explicit bias assumption, yet the performance cost of replacing an exact planner with a differentiable approximation is substantially larger.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical comparison of reward inference approaches (learning a differentiable planner from data versus assuming specific human biases like risk-aversion or myopia). No load-bearing derivations, predictions, or uniqueness theorems are present that reduce to fitted inputs or self-citations by construction. The mixed findings are directly supported by experiments and released code, with the central claim being internally consistent with the stated goal of testing a data-driven alternative. This is the most common honest finding for purely empirical work without theoretical reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract. The approach relies on standard IRL assumptions and differentiable optimization from prior literature.

pith-pipeline@v0.9.0 · 5789 in / 996 out tokens · 37888 ms · 2026-05-25T17:32:01.744098+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mitigating Cognitive Bias in RLHF by Altering Rationality
cs.AI 2026-05 unverdicted novelty 6.0

Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

and Ng, A

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp.\ 1. ACM, 2004

work page 2004
[3]

Learning from human preferences

Amodei, D., Christiano, P., and Ray, A. Learning from human preferences. https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/, 2017

work page 2017
[4]

and Mindermann, S

Armstrong, S. and Mindermann, S. Occam's razor is insufficient to infer the preferences of irrational agents. In Advances in Neural Information Processing Systems, pp.\ 5603--5614, 2018

work page 2018
[5]

Baker, C., Saxe, R., and Tenenbaum, J. B. Bayesian models of human action understanding. In Advances in neural information processing systems, pp.\ 99--106, 2006

work page 2006
[6]

Baker, C. L. and Tenenbaum, J. B. Modeling human plan recognition using bayesian theory of mind. Plan, activity, and intent recognition: Theory and practice, pp.\ 177--204, 2014

work page 2014
[7]

planning fallacy

Buehler, R., Griffin, D., and Ross, M. Exploring the "planning fallacy": Why people underestimate their task completion times. Journal of personality and social psychology, 67 0 (3): 0 366, 1994

work page 1994
[8]

and Kim, K.-E

Choi, J. and Kim, K.-E. Nonparametric bayesian inverse reinforcement learning for multiple reward functions. In Advances in Neural Information Processing Systems, pp.\ 305--313, 2012

work page 2012
[9]

The easy goal inference problem is still hard

Christiano, P. The easy goal inference problem is still hard. https://ai-alignment.com/the-easy-goal-inference-problem-is-still-hard-fad030e0a876, 2015

work page 2015
[10]

and Rothkopf, C

Dimitrakakis, C. and Rothkopf, C. A. Bayesian multitask inverse reinforcement learning. In European Workshop on Reinforcement Learning, pp.\ 273--284. Springer, 2011

work page 2011
[11]

and Goodman, N

Evans, O. and Goodman, N. D. Learning the preferences of bounded agents. In NIPS Workshop on Bounded Optimality, volume 6, 2015

work page 2015
[12]

Evans, O., Stuhlm \"u ller, A., and Goodman, N. D. Learning the preferences of ignorant, inconsistent agents. In AAAI, pp.\ 323--329, 2016

work page 2016
[13]

Guided cost learning: Deep inverse optimal control via policy optimization

Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp.\ 49--58, 2016

work page 2016
[14]

Time discounting and time preference: A critical review

Frederick, S., Loewenstein, G., and O'donoghue, T. Time discounting and time preference: A critical review. Journal of economic literature, 40 0 (2): 0 351--401, 2002

work page 2002
[15]

Multi-task Maximum Entropy Inverse Reinforcement Learning

Gleave, A. and Habryka, O. Multi-task maximum entropy inverse reinforcement learning. arXiv preprint arXiv:1805.08882, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Learning to Search with MCTSnets

Guez, A., Weber, T., Antonoglou, I., Simonyan, K., Vinyals, O., Wierstra, D., Munos, R., and Silver, D. Learning to search with mctsnets. arXiv preprint arXiv:1802.04697, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

J., and Dragan, A

Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. In Advances in Neural Information Processing Systems, pp.\ 6768--6777, 2017

work page 2017
[18]

A perspective on judgment and choice: mapping bounded rationality

Kahneman, D. A perspective on judgment and choice: mapping bounded rationality. American psychologist, 58 0 (9): 0 697, 2003

work page 2003
[19]

Specification gaming examples in ai

Krakovna, V. Specification gaming examples in ai. https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/, 2018

work page 2018
[20]

Risk-sensitive inverse reinforcement learning via coherent risk models

Majumdar, A., Singh, S., Mandlekar, A., and Pavone, M. Risk-sensitive inverse reinforcement learning via coherent risk models. In Robotics: Science and Systems, 2017

work page 2017
[21]

Y., Russell, S

Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp.\ 663--670, 2000

work page 2000
[22]

Feature visualization

Olah, C., Mordvintsev, A., and Schubert, L. Feature visualization. Distill, 2 0 (11): 0 e7, 2017

work page 2017
[23]

Learning model-based planning from scratch

Pascanu, R., Li, Y., Vinyals, O., Heess, N., Buesing, L., Racani \`e re, S., Reichert, D., Weber, T., Wierstra, D., and Battaglia, P. Learning model-based planning from scratch. arXiv preprint arXiv:1707.06170, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014
[25]

Machine Theory of Mind

Rabinowitz, N. C., Perbet, F., Song, H. F., Zhang, C., Eslami, S., and Botvinick, M. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Where do you think you're going?: Inferring beliefs about dynamics from behavior

Reddy, S., Dragan, A., and Levine, S. Where do you think you're going?: Inferring beliefs about dynamics from behavior. In Advances in Neural Information Processing Systems, pp.\ 1454--1465, 2018

work page 2018
[27]

Learning agents for uncertain environments

Russell, S. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pp.\ 101--103. ACM, 1998

work page 1998
[28]

Inverse reinforcement learning from failure

Shiarlis, K., Messias, J., and Whiteson, S. Inverse reinforcement learning from failure. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp.\ 1060--1068. International Foundation for Autonomous Agents and Multiagent Systems, 2016

work page 2016
[29]

Universal Planning Networks

Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. Universal planning networks. arXiv preprint arXiv:1804.00645, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Latent variables and model mis-specification

Steinhardt, J. Latent variables and model mis-specification. https://jsteinhardt.wordpress.com/2017/01/10/latent-variables-and-model-mis-specification/, 2017

work page 2017
[31]

and Evans, O

Steinhardt, J. and Evans, O. Model mis-specification and inverse reinforcement learning. https://jsteinhardt.wordpress.com/2017/02/07/model-mis-specification-and-inverse-reinforcement-learning/, 2017

work page 2017
[32]

Value iteration networks

Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. In Advances in Neural Information Processing Systems, pp.\ 2154--2162, 2016

work page 2016
[33]

and Kahneman, D

Tversky, A. and Kahneman, D. Availability: A heuristic for judging frequency and probability. Cognitive psychology, 5 0 (2): 0 207--232, 1973

work page 1973
[34]

Learning a prior over intent via meta-inverse reinforcement learning

Xu, K., Ratner, E., Dragan, A., Levine, S., and Finn, C. Learning a prior over intent via meta-inverse reinforcement learning. arXiv preprint arXiv:1805.12573, 2018

work page arXiv 2018
[35]

Zheng, J., Liu, S., and Ni, L. M. Robust bayesian inverse reinforcement learning with sparse behavior noise. In AAAI, pp.\ 2198--2205, 2014

work page 2014
[36]

D., Maas, A

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp.\ 1433--1438. Chicago, IL, USA, 2008

work page 2008

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

and Ng, A

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp.\ 1. ACM, 2004

work page 2004

[3] [3]

Learning from human preferences

Amodei, D., Christiano, P., and Ray, A. Learning from human preferences. https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/, 2017

work page 2017

[4] [4]

and Mindermann, S

Armstrong, S. and Mindermann, S. Occam's razor is insufficient to infer the preferences of irrational agents. In Advances in Neural Information Processing Systems, pp.\ 5603--5614, 2018

work page 2018

[5] [5]

Baker, C., Saxe, R., and Tenenbaum, J. B. Bayesian models of human action understanding. In Advances in neural information processing systems, pp.\ 99--106, 2006

work page 2006

[6] [6]

Baker, C. L. and Tenenbaum, J. B. Modeling human plan recognition using bayesian theory of mind. Plan, activity, and intent recognition: Theory and practice, pp.\ 177--204, 2014

work page 2014

[7] [7]

planning fallacy

Buehler, R., Griffin, D., and Ross, M. Exploring the "planning fallacy": Why people underestimate their task completion times. Journal of personality and social psychology, 67 0 (3): 0 366, 1994

work page 1994

[8] [8]

and Kim, K.-E

Choi, J. and Kim, K.-E. Nonparametric bayesian inverse reinforcement learning for multiple reward functions. In Advances in Neural Information Processing Systems, pp.\ 305--313, 2012

work page 2012

[9] [9]

The easy goal inference problem is still hard

Christiano, P. The easy goal inference problem is still hard. https://ai-alignment.com/the-easy-goal-inference-problem-is-still-hard-fad030e0a876, 2015

work page 2015

[10] [10]

and Rothkopf, C

Dimitrakakis, C. and Rothkopf, C. A. Bayesian multitask inverse reinforcement learning. In European Workshop on Reinforcement Learning, pp.\ 273--284. Springer, 2011

work page 2011

[11] [11]

and Goodman, N

Evans, O. and Goodman, N. D. Learning the preferences of bounded agents. In NIPS Workshop on Bounded Optimality, volume 6, 2015

work page 2015

[12] [12]

Evans, O., Stuhlm \"u ller, A., and Goodman, N. D. Learning the preferences of ignorant, inconsistent agents. In AAAI, pp.\ 323--329, 2016

work page 2016

[13] [13]

Guided cost learning: Deep inverse optimal control via policy optimization

Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp.\ 49--58, 2016

work page 2016

[14] [14]

Time discounting and time preference: A critical review

Frederick, S., Loewenstein, G., and O'donoghue, T. Time discounting and time preference: A critical review. Journal of economic literature, 40 0 (2): 0 351--401, 2002

work page 2002

[15] [15]

Multi-task Maximum Entropy Inverse Reinforcement Learning

Gleave, A. and Habryka, O. Multi-task maximum entropy inverse reinforcement learning. arXiv preprint arXiv:1805.08882, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Learning to Search with MCTSnets

Guez, A., Weber, T., Antonoglou, I., Simonyan, K., Vinyals, O., Wierstra, D., Munos, R., and Silver, D. Learning to search with mctsnets. arXiv preprint arXiv:1802.04697, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

J., and Dragan, A

Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. In Advances in Neural Information Processing Systems, pp.\ 6768--6777, 2017

work page 2017

[18] [18]

A perspective on judgment and choice: mapping bounded rationality

Kahneman, D. A perspective on judgment and choice: mapping bounded rationality. American psychologist, 58 0 (9): 0 697, 2003

work page 2003

[19] [19]

Specification gaming examples in ai

Krakovna, V. Specification gaming examples in ai. https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/, 2018

work page 2018

[20] [20]

Risk-sensitive inverse reinforcement learning via coherent risk models

Majumdar, A., Singh, S., Mandlekar, A., and Pavone, M. Risk-sensitive inverse reinforcement learning via coherent risk models. In Robotics: Science and Systems, 2017

work page 2017

[21] [21]

Y., Russell, S

Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp.\ 663--670, 2000

work page 2000

[22] [22]

Feature visualization

Olah, C., Mordvintsev, A., and Schubert, L. Feature visualization. Distill, 2 0 (11): 0 e7, 2017

work page 2017

[23] [23]

Learning model-based planning from scratch

Pascanu, R., Li, Y., Vinyals, O., Heess, N., Buesing, L., Racani \`e re, S., Reichert, D., Weber, T., Wierstra, D., and Battaglia, P. Learning model-based planning from scratch. arXiv preprint arXiv:1707.06170, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014

[25] [25]

Machine Theory of Mind

Rabinowitz, N. C., Perbet, F., Song, H. F., Zhang, C., Eslami, S., and Botvinick, M. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Where do you think you're going?: Inferring beliefs about dynamics from behavior

Reddy, S., Dragan, A., and Levine, S. Where do you think you're going?: Inferring beliefs about dynamics from behavior. In Advances in Neural Information Processing Systems, pp.\ 1454--1465, 2018

work page 2018

[27] [27]

Learning agents for uncertain environments

Russell, S. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pp.\ 101--103. ACM, 1998

work page 1998

[28] [28]

Inverse reinforcement learning from failure

Shiarlis, K., Messias, J., and Whiteson, S. Inverse reinforcement learning from failure. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp.\ 1060--1068. International Foundation for Autonomous Agents and Multiagent Systems, 2016

work page 2016

[29] [29]

Universal Planning Networks

Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. Universal planning networks. arXiv preprint arXiv:1804.00645, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Latent variables and model mis-specification

Steinhardt, J. Latent variables and model mis-specification. https://jsteinhardt.wordpress.com/2017/01/10/latent-variables-and-model-mis-specification/, 2017

work page 2017

[31] [31]

and Evans, O

Steinhardt, J. and Evans, O. Model mis-specification and inverse reinforcement learning. https://jsteinhardt.wordpress.com/2017/02/07/model-mis-specification-and-inverse-reinforcement-learning/, 2017

work page 2017

[32] [32]

Value iteration networks

Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. In Advances in Neural Information Processing Systems, pp.\ 2154--2162, 2016

work page 2016

[33] [33]

and Kahneman, D

Tversky, A. and Kahneman, D. Availability: A heuristic for judging frequency and probability. Cognitive psychology, 5 0 (2): 0 207--232, 1973

work page 1973

[34] [34]

Learning a prior over intent via meta-inverse reinforcement learning

Xu, K., Ratner, E., Dragan, A., Levine, S., and Finn, C. Learning a prior over intent via meta-inverse reinforcement learning. arXiv preprint arXiv:1805.12573, 2018

work page arXiv 2018

[35] [35]

Zheng, J., Liu, S., and Ni, L. M. Robust bayesian inverse reinforcement learning with sparse behavior noise. In AAAI, pp.\ 2198--2205, 2014

work page 2014

[36] [36]

D., Maas, A

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp.\ 1433--1438. Chicago, IL, USA, 2008

work page 2008