On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference
Pith reviewed 2026-05-25 17:32 UTC · model grok-4.3
The pith
Learning a demonstrator's planning process improves reward inference over wrong bias assumptions, but the switch from exact to differentiable planning hurts more than the gain helps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Rather than assuming the expert is noisily optimal or has a known bias such as risk-aversion, the method learns the demonstrator's planning algorithm itself as a differentiable planner from the provided demonstrations. Experiments demonstrate that this learned planner produces better reward inference than an incorrect fixed bias assumption, but the performance loss incurred by replacing an exact planner with the learned differentiable version is substantially larger than the benefit obtained from avoiding the wrong assumption.
What carries the argument
A differentiable planner trained on demonstrations to stand in for the demonstrator's actual planning process inside the reward inference objective.
If this is right
- Reward inference accuracy rises when the learned planner matches the demonstrator's actual process better than a mismatched explicit bias does.
- An exact planner paired with the correct bias assumption produces the lowest inference error.
- The approximation gap between exact and differentiable planners exceeds the penalty from using an incorrect bias.
- Reward inference systems benefit from retaining some explicit structure rather than moving to fully learned planners.
Where Pith is reading between the lines
- Hybrid planners that embed limited learnable parameters inside an otherwise exact planning algorithm could reduce approximation error while still adapting to observed behavior.
- Evaluating the same tradeoff on real human demonstrations rather than simulated biased planners would test whether the reported gap persists outside controlled settings.
- Improving the training stability or capacity of differentiable planners might narrow the performance difference to exact planners enough to make bias learning advantageous.
Load-bearing premise
The error introduced by training and using a differentiable planner remains small enough that it does not overwhelm any improvement gained by avoiding a mismatched bias assumption.
What would settle it
An experiment in which reward inference error with the learned differentiable planner falls below the error obtained with an exact planner that uses the correct bias assumption would falsify the dominance of the approximation loss.
Figures
read the original abstract
Our goal is for agents to optimize the right reward function, despite how difficult it is for us to specify what that is. Inverse Reinforcement Learning (IRL) enables us to infer reward functions from demonstrations, but it usually assumes that the expert is noisily optimal. Real people, on the other hand, often have systematic biases: risk-aversion, myopia, etc. One option is to try to characterize these biases and account for them explicitly during learning. But in the era of deep learning, a natural suggestion researchers make is to avoid mathematical models of human behavior that are fraught with specific assumptions, and instead use a purely data-driven approach. We decided to put this to the test -- rather than relying on assumptions about which specific bias the demonstrator has when planning, we instead learn the demonstrator's planning algorithm that they use to generate demonstrations, as a differentiable planner. Our exploration yielded mixed findings: on the one hand, learning the planner can lead to better reward inference than relying on the wrong assumption; on the other hand, this benefit is dwarfed by the loss we incur by going from an exact to a differentiable planner. This suggest that at least for the foreseeable future, agents need a middle ground between the flexibility of data-driven methods and the useful bias of known human biases. Code is available at https://tinyurl.com/learningbiases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the feasibility of learning a differentiable planner from demonstrations as an alternative to explicitly assuming specific human biases (such as risk aversion or myopia) when performing reward inference via inverse reinforcement learning. Through empirical comparisons in multiple environments, it reports that learning the planner can outperform reward inference based on an incorrect bias assumption, but that the performance degradation from replacing an exact planner with a differentiable approximation is substantially larger than this benefit, leading to the conclusion that a middle ground between data-driven flexibility and known bias models is needed. Code is released.
Significance. If the reported comparisons hold under the chosen architectures and environments, the work supplies concrete empirical evidence on the relative costs of model mismatch versus approximation error in IRL, supporting the value of hybrid approaches. The release of code is a positive factor for reproducibility and further investigation of the trade-off.
minor comments (2)
- [§4] §4 (Experiments): the quantitative results comparing learned planner vs. mismatched bias would benefit from explicit reporting of statistical significance or error bars across runs to strengthen the mixed-finding claim.
- The environments and planner architectures are described, but a table summarizing all hyperparameter choices and baseline implementations would improve clarity for replication.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report correctly captures the paper's core empirical finding that learning a differentiable planner can improve reward inference relative to an incorrect explicit bias assumption, yet the performance cost of replacing an exact planner with a differentiable approximation is substantially larger.
Circularity Check
No significant circularity identified
full rationale
The paper is an empirical comparison of reward inference approaches (learning a differentiable planner from data versus assuming specific human biases like risk-aversion or myopia). No load-bearing derivations, predictions, or uniqueness theorems are present that reduce to fitted inputs or self-citations by construction. The mixed findings are directly supported by experiments and released code, with the central claim being internally consistent with the stated goal of testing a data-driven alternative. This is the most common honest finding for purely empirical work without theoretical reductions.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Mitigating Cognitive Bias in RLHF by Altering Rationality
Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
- [2]
-
[3]
Learning from human preferences
Amodei, D., Christiano, P., and Ray, A. Learning from human preferences. https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/, 2017
work page 2017
-
[4]
Armstrong, S. and Mindermann, S. Occam's razor is insufficient to infer the preferences of irrational agents. In Advances in Neural Information Processing Systems, pp.\ 5603--5614, 2018
work page 2018
-
[5]
Baker, C., Saxe, R., and Tenenbaum, J. B. Bayesian models of human action understanding. In Advances in neural information processing systems, pp.\ 99--106, 2006
work page 2006
-
[6]
Baker, C. L. and Tenenbaum, J. B. Modeling human plan recognition using bayesian theory of mind. Plan, activity, and intent recognition: Theory and practice, pp.\ 177--204, 2014
work page 2014
-
[7]
Buehler, R., Griffin, D., and Ross, M. Exploring the "planning fallacy": Why people underestimate their task completion times. Journal of personality and social psychology, 67 0 (3): 0 366, 1994
work page 1994
-
[8]
Choi, J. and Kim, K.-E. Nonparametric bayesian inverse reinforcement learning for multiple reward functions. In Advances in Neural Information Processing Systems, pp.\ 305--313, 2012
work page 2012
-
[9]
The easy goal inference problem is still hard
Christiano, P. The easy goal inference problem is still hard. https://ai-alignment.com/the-easy-goal-inference-problem-is-still-hard-fad030e0a876, 2015
work page 2015
-
[10]
Dimitrakakis, C. and Rothkopf, C. A. Bayesian multitask inverse reinforcement learning. In European Workshop on Reinforcement Learning, pp.\ 273--284. Springer, 2011
work page 2011
-
[11]
Evans, O. and Goodman, N. D. Learning the preferences of bounded agents. In NIPS Workshop on Bounded Optimality, volume 6, 2015
work page 2015
-
[12]
Evans, O., Stuhlm \"u ller, A., and Goodman, N. D. Learning the preferences of ignorant, inconsistent agents. In AAAI, pp.\ 323--329, 2016
work page 2016
-
[13]
Guided cost learning: Deep inverse optimal control via policy optimization
Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp.\ 49--58, 2016
work page 2016
-
[14]
Time discounting and time preference: A critical review
Frederick, S., Loewenstein, G., and O'donoghue, T. Time discounting and time preference: A critical review. Journal of economic literature, 40 0 (2): 0 351--401, 2002
work page 2002
-
[15]
Multi-task Maximum Entropy Inverse Reinforcement Learning
Gleave, A. and Habryka, O. Multi-task maximum entropy inverse reinforcement learning. arXiv preprint arXiv:1805.08882, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
Learning to Search with MCTSnets
Guez, A., Weber, T., Antonoglou, I., Simonyan, K., Vinyals, O., Wierstra, D., Munos, R., and Silver, D. Learning to search with mctsnets. arXiv preprint arXiv:1802.04697, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. In Advances in Neural Information Processing Systems, pp.\ 6768--6777, 2017
work page 2017
-
[18]
A perspective on judgment and choice: mapping bounded rationality
Kahneman, D. A perspective on judgment and choice: mapping bounded rationality. American psychologist, 58 0 (9): 0 697, 2003
work page 2003
-
[19]
Specification gaming examples in ai
Krakovna, V. Specification gaming examples in ai. https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/, 2018
work page 2018
-
[20]
Risk-sensitive inverse reinforcement learning via coherent risk models
Majumdar, A., Singh, S., Mandlekar, A., and Pavone, M. Risk-sensitive inverse reinforcement learning via coherent risk models. In Robotics: Science and Systems, 2017
work page 2017
-
[21]
Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp.\ 663--670, 2000
work page 2000
-
[22]
Olah, C., Mordvintsev, A., and Schubert, L. Feature visualization. Distill, 2 0 (11): 0 e7, 2017
work page 2017
-
[23]
Learning model-based planning from scratch
Pascanu, R., Li, Y., Vinyals, O., Heess, N., Buesing, L., Racani \`e re, S., Reichert, D., Weber, T., Wierstra, D., and Battaglia, P. Learning model-based planning from scratch. arXiv preprint arXiv:1707.06170, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014
work page 2014
-
[25]
Rabinowitz, N. C., Perbet, F., Song, H. F., Zhang, C., Eslami, S., and Botvinick, M. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[26]
Where do you think you're going?: Inferring beliefs about dynamics from behavior
Reddy, S., Dragan, A., and Levine, S. Where do you think you're going?: Inferring beliefs about dynamics from behavior. In Advances in Neural Information Processing Systems, pp.\ 1454--1465, 2018
work page 2018
-
[27]
Learning agents for uncertain environments
Russell, S. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pp.\ 101--103. ACM, 1998
work page 1998
-
[28]
Inverse reinforcement learning from failure
Shiarlis, K., Messias, J., and Whiteson, S. Inverse reinforcement learning from failure. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp.\ 1060--1068. International Foundation for Autonomous Agents and Multiagent Systems, 2016
work page 2016
-
[29]
Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. Universal planning networks. arXiv preprint arXiv:1804.00645, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
Latent variables and model mis-specification
Steinhardt, J. Latent variables and model mis-specification. https://jsteinhardt.wordpress.com/2017/01/10/latent-variables-and-model-mis-specification/, 2017
work page 2017
-
[31]
Steinhardt, J. and Evans, O. Model mis-specification and inverse reinforcement learning. https://jsteinhardt.wordpress.com/2017/02/07/model-mis-specification-and-inverse-reinforcement-learning/, 2017
work page 2017
-
[32]
Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. In Advances in Neural Information Processing Systems, pp.\ 2154--2162, 2016
work page 2016
-
[33]
Tversky, A. and Kahneman, D. Availability: A heuristic for judging frequency and probability. Cognitive psychology, 5 0 (2): 0 207--232, 1973
work page 1973
-
[34]
Learning a prior over intent via meta-inverse reinforcement learning
Xu, K., Ratner, E., Dragan, A., Levine, S., and Finn, C. Learning a prior over intent via meta-inverse reinforcement learning. arXiv preprint arXiv:1805.12573, 2018
-
[35]
Zheng, J., Liu, S., and Ni, L. M. Robust bayesian inverse reinforcement learning with sparse behavior noise. In AAAI, pp.\ 2198--2205, 2014
work page 2014
-
[36]
Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp.\ 1433--1438. Chicago, IL, USA, 2008
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.