pith. sign in

arxiv: 1906.09624 · v1 · pith:L2WVJCMFnew · submitted 2019-06-23 · 💻 cs.LG · cs.AI· stat.ML

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

Pith reviewed 2026-05-25 17:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords inverse reinforcement learningreward inferencehuman biasesdifferentiable plannerlearning from demonstrationsplanning approximationIRL
0
0 comments X

The pith

Learning a demonstrator's planning process improves reward inference over wrong bias assumptions, but the switch from exact to differentiable planning hurts more than the gain helps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether replacing explicit assumptions about human biases in inverse reinforcement learning with a data-driven learned planner yields better reward functions. It trains a differentiable planner on demonstrations to capture whatever planning process the demonstrator actually uses, then compares the resulting reward inference quality against baselines that assume noisily optimal behavior or a specific wrong bias such as risk aversion. Results show a modest improvement from avoiding the wrong assumption, yet this gain is smaller than the degradation introduced by the planner's approximation error relative to an exact planner. The work concludes that reward inference currently needs an intermediate approach that retains useful bias structure while adding data-driven flexibility.

Core claim

Rather than assuming the expert is noisily optimal or has a known bias such as risk-aversion, the method learns the demonstrator's planning algorithm itself as a differentiable planner from the provided demonstrations. Experiments demonstrate that this learned planner produces better reward inference than an incorrect fixed bias assumption, but the performance loss incurred by replacing an exact planner with the learned differentiable version is substantially larger than the benefit obtained from avoiding the wrong assumption.

What carries the argument

A differentiable planner trained on demonstrations to stand in for the demonstrator's actual planning process inside the reward inference objective.

If this is right

  • Reward inference accuracy rises when the learned planner matches the demonstrator's actual process better than a mismatched explicit bias does.
  • An exact planner paired with the correct bias assumption produces the lowest inference error.
  • The approximation gap between exact and differentiable planners exceeds the penalty from using an incorrect bias.
  • Reward inference systems benefit from retaining some explicit structure rather than moving to fully learned planners.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid planners that embed limited learnable parameters inside an otherwise exact planning algorithm could reduce approximation error while still adapting to observed behavior.
  • Evaluating the same tradeoff on real human demonstrations rather than simulated biased planners would test whether the reported gap persists outside controlled settings.
  • Improving the training stability or capacity of differentiable planners might narrow the performance difference to exact planners enough to make bias learning advantageous.

Load-bearing premise

The error introduced by training and using a differentiable planner remains small enough that it does not overwhelm any improvement gained by avoiding a mismatched bias assumption.

What would settle it

An experiment in which reward inference error with the learned differentiable planner falls below the error obtained with an exact planner that uses the correct bias assumption would falsify the dominance of the approximation loss.

Figures

Figures reproduced from arXiv: 1906.09624 by Anca D. Dragan, Noah Gundotra, Pieter Abbeel, Rohin Shah.

Figure 1
Figure 1. Figure 1: While we could correct for systematic biases by having our AI system reason about explicit models of human reasoning, using the wrong assumption can lead to agents that do not correctly understand what people want. A natural alternative is to learn human biases from data. Our goal in this work is to investigate this alternative and gain insight into what additional assumptions might make it feasible and wh… view at source ↗
Figure 2
Figure 2. Figure 2: The plans of our synthetic agents on two navigation environments. Actual trajectories could differ due to randomness in the transitions. Green squares indicate positive reward while red squares indicate negative reward, with darker colors indicating higher magnitude of reward. sophisticated hyperbolic time discounters can be "tempted" by a proximate smaller reward. The naive agent fails to an￾ticipate the … view at source ↗
Figure 3
Figure 3. Figure 3: The architecture and operations on it used in the algorithms. training jointly as in Equation TRAIN-JOINTLY. Since the reward inference requires a differentiable planner, we need a method that sets a differentiable planner to be optimal (that is, the planner that maximizes expected reward). This can be done by simulating data from an optimal agent with ran￾domly generated world models and rewards, and use … view at source ↗
Figure 4
Figure 4. Figure 4: Reward obtained when planning with the inferred reward, as a percentage of the maximum possible reward, for different bias models and algorithms. We implement five types of biases by modifying the value iteration algorithm to produce a different set of Q-values. The top row shows results for agents that choose between the best actions from these Q-values, while in the bottom row the agent chooses actions w… view at source ↗
Figure 5
Figure 5. Figure 5: Percent reward obtained for different bias models using variations of Algorithm 2, which does not get access to any known rewards. These algorithms can vary along two dimensions – whether they are initialized with the assumption that the demonstrator is rational, and whether they train the planner and reward jointly or with coordinate ascent. The original version of Algorithm 2 does initialize, and trains … view at source ↗
read the original abstract

Our goal is for agents to optimize the right reward function, despite how difficult it is for us to specify what that is. Inverse Reinforcement Learning (IRL) enables us to infer reward functions from demonstrations, but it usually assumes that the expert is noisily optimal. Real people, on the other hand, often have systematic biases: risk-aversion, myopia, etc. One option is to try to characterize these biases and account for them explicitly during learning. But in the era of deep learning, a natural suggestion researchers make is to avoid mathematical models of human behavior that are fraught with specific assumptions, and instead use a purely data-driven approach. We decided to put this to the test -- rather than relying on assumptions about which specific bias the demonstrator has when planning, we instead learn the demonstrator's planning algorithm that they use to generate demonstrations, as a differentiable planner. Our exploration yielded mixed findings: on the one hand, learning the planner can lead to better reward inference than relying on the wrong assumption; on the other hand, this benefit is dwarfed by the loss we incur by going from an exact to a differentiable planner. This suggest that at least for the foreseeable future, agents need a middle ground between the flexibility of data-driven methods and the useful bias of known human biases. Code is available at https://tinyurl.com/learningbiases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper investigates the feasibility of learning a differentiable planner from demonstrations as an alternative to explicitly assuming specific human biases (such as risk aversion or myopia) when performing reward inference via inverse reinforcement learning. Through empirical comparisons in multiple environments, it reports that learning the planner can outperform reward inference based on an incorrect bias assumption, but that the performance degradation from replacing an exact planner with a differentiable approximation is substantially larger than this benefit, leading to the conclusion that a middle ground between data-driven flexibility and known bias models is needed. Code is released.

Significance. If the reported comparisons hold under the chosen architectures and environments, the work supplies concrete empirical evidence on the relative costs of model mismatch versus approximation error in IRL, supporting the value of hybrid approaches. The release of code is a positive factor for reproducibility and further investigation of the trade-off.

minor comments (2)
  1. [§4] §4 (Experiments): the quantitative results comparing learned planner vs. mismatched bias would benefit from explicit reporting of statistical significance or error bars across runs to strengthen the mixed-finding claim.
  2. The environments and planner architectures are described, but a table summarizing all hyperparameter choices and baseline implementations would improve clarity for replication.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report correctly captures the paper's core empirical finding that learning a differentiable planner can improve reward inference relative to an incorrect explicit bias assumption, yet the performance cost of replacing an exact planner with a differentiable approximation is substantially larger.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical comparison of reward inference approaches (learning a differentiable planner from data versus assuming specific human biases like risk-aversion or myopia). No load-bearing derivations, predictions, or uniqueness theorems are present that reduce to fitted inputs or self-citations by construction. The mixed findings are directly supported by experiments and released code, with the central claim being internally consistent with the stated goal of testing a data-driven alternative. This is the most common honest finding for purely empirical work without theoretical reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract. The approach relies on standard IRL assumptions and differentiable optimization from prior literature.

pith-pipeline@v0.9.0 · 5789 in / 996 out tokens · 37888 ms · 2026-05-25T17:32:01.744098+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mitigating Cognitive Bias in RLHF by Altering Rationality

    cs.AI 2026-05 unverdicted novelty 6.0

    Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    and Ng, A

    Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp.\ 1. ACM, 2004

  3. [3]

    Learning from human preferences

    Amodei, D., Christiano, P., and Ray, A. Learning from human preferences. https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/, 2017

  4. [4]

    and Mindermann, S

    Armstrong, S. and Mindermann, S. Occam's razor is insufficient to infer the preferences of irrational agents. In Advances in Neural Information Processing Systems, pp.\ 5603--5614, 2018

  5. [5]

    Baker, C., Saxe, R., and Tenenbaum, J. B. Bayesian models of human action understanding. In Advances in neural information processing systems, pp.\ 99--106, 2006

  6. [6]

    Baker, C. L. and Tenenbaum, J. B. Modeling human plan recognition using bayesian theory of mind. Plan, activity, and intent recognition: Theory and practice, pp.\ 177--204, 2014

  7. [7]

    planning fallacy

    Buehler, R., Griffin, D., and Ross, M. Exploring the "planning fallacy": Why people underestimate their task completion times. Journal of personality and social psychology, 67 0 (3): 0 366, 1994

  8. [8]

    and Kim, K.-E

    Choi, J. and Kim, K.-E. Nonparametric bayesian inverse reinforcement learning for multiple reward functions. In Advances in Neural Information Processing Systems, pp.\ 305--313, 2012

  9. [9]

    The easy goal inference problem is still hard

    Christiano, P. The easy goal inference problem is still hard. https://ai-alignment.com/the-easy-goal-inference-problem-is-still-hard-fad030e0a876, 2015

  10. [10]

    and Rothkopf, C

    Dimitrakakis, C. and Rothkopf, C. A. Bayesian multitask inverse reinforcement learning. In European Workshop on Reinforcement Learning, pp.\ 273--284. Springer, 2011

  11. [11]

    and Goodman, N

    Evans, O. and Goodman, N. D. Learning the preferences of bounded agents. In NIPS Workshop on Bounded Optimality, volume 6, 2015

  12. [12]

    Evans, O., Stuhlm \"u ller, A., and Goodman, N. D. Learning the preferences of ignorant, inconsistent agents. In AAAI, pp.\ 323--329, 2016

  13. [13]

    Guided cost learning: Deep inverse optimal control via policy optimization

    Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp.\ 49--58, 2016

  14. [14]

    Time discounting and time preference: A critical review

    Frederick, S., Loewenstein, G., and O'donoghue, T. Time discounting and time preference: A critical review. Journal of economic literature, 40 0 (2): 0 351--401, 2002

  15. [15]

    Multi-task Maximum Entropy Inverse Reinforcement Learning

    Gleave, A. and Habryka, O. Multi-task maximum entropy inverse reinforcement learning. arXiv preprint arXiv:1805.08882, 2018

  16. [16]

    Learning to Search with MCTSnets

    Guez, A., Weber, T., Antonoglou, I., Simonyan, K., Vinyals, O., Wierstra, D., Munos, R., and Silver, D. Learning to search with mctsnets. arXiv preprint arXiv:1802.04697, 2018

  17. [17]

    J., and Dragan, A

    Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. In Advances in Neural Information Processing Systems, pp.\ 6768--6777, 2017

  18. [18]

    A perspective on judgment and choice: mapping bounded rationality

    Kahneman, D. A perspective on judgment and choice: mapping bounded rationality. American psychologist, 58 0 (9): 0 697, 2003

  19. [19]

    Specification gaming examples in ai

    Krakovna, V. Specification gaming examples in ai. https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/, 2018

  20. [20]

    Risk-sensitive inverse reinforcement learning via coherent risk models

    Majumdar, A., Singh, S., Mandlekar, A., and Pavone, M. Risk-sensitive inverse reinforcement learning via coherent risk models. In Robotics: Science and Systems, 2017

  21. [21]

    Y., Russell, S

    Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp.\ 663--670, 2000

  22. [22]

    Feature visualization

    Olah, C., Mordvintsev, A., and Schubert, L. Feature visualization. Distill, 2 0 (11): 0 e7, 2017

  23. [23]

    Learning model-based planning from scratch

    Pascanu, R., Li, Y., Vinyals, O., Heess, N., Buesing, L., Racani \`e re, S., Reichert, D., Weber, T., Wierstra, D., and Battaglia, P. Learning model-based planning from scratch. arXiv preprint arXiv:1707.06170, 2017

  24. [24]

    Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

  25. [25]

    Machine Theory of Mind

    Rabinowitz, N. C., Perbet, F., Song, H. F., Zhang, C., Eslami, S., and Botvinick, M. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018

  26. [26]

    Where do you think you're going?: Inferring beliefs about dynamics from behavior

    Reddy, S., Dragan, A., and Levine, S. Where do you think you're going?: Inferring beliefs about dynamics from behavior. In Advances in Neural Information Processing Systems, pp.\ 1454--1465, 2018

  27. [27]

    Learning agents for uncertain environments

    Russell, S. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pp.\ 101--103. ACM, 1998

  28. [28]

    Inverse reinforcement learning from failure

    Shiarlis, K., Messias, J., and Whiteson, S. Inverse reinforcement learning from failure. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp.\ 1060--1068. International Foundation for Autonomous Agents and Multiagent Systems, 2016

  29. [29]

    Universal Planning Networks

    Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. Universal planning networks. arXiv preprint arXiv:1804.00645, 2018

  30. [30]

    Latent variables and model mis-specification

    Steinhardt, J. Latent variables and model mis-specification. https://jsteinhardt.wordpress.com/2017/01/10/latent-variables-and-model-mis-specification/, 2017

  31. [31]

    and Evans, O

    Steinhardt, J. and Evans, O. Model mis-specification and inverse reinforcement learning. https://jsteinhardt.wordpress.com/2017/02/07/model-mis-specification-and-inverse-reinforcement-learning/, 2017

  32. [32]

    Value iteration networks

    Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. In Advances in Neural Information Processing Systems, pp.\ 2154--2162, 2016

  33. [33]

    and Kahneman, D

    Tversky, A. and Kahneman, D. Availability: A heuristic for judging frequency and probability. Cognitive psychology, 5 0 (2): 0 207--232, 1973

  34. [34]

    Learning a prior over intent via meta-inverse reinforcement learning

    Xu, K., Ratner, E., Dragan, A., Levine, S., and Finn, C. Learning a prior over intent via meta-inverse reinforcement learning. arXiv preprint arXiv:1805.12573, 2018

  35. [35]

    Zheng, J., Liu, S., and Ni, L. M. Robust bayesian inverse reinforcement learning with sparse behavior noise. In AAAI, pp.\ 2198--2205, 2014

  36. [36]

    D., Maas, A

    Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp.\ 1433--1438. Chicago, IL, USA, 2008