pith. sign in

arxiv: 2605.30903 · v1 · pith:V6YJLZJ6new · submitted 2026-05-29 · 💻 cs.LG · cs.AI

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

Pith reviewed 2026-06-28 23:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords inverse reinforcement learningfeasible reward setsuboptimal demonstratorslinear constraintsreward recoveryfunction approximationLLM fine-tuning
0
0 comments X

The pith

Intersecting linear constraints from multiple imperfect demonstrators recovers the ground-truth optimal reward set under coverage conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies inverse reinforcement learning when demonstrations come from several imperfect agents whose suboptimality levels are known in advance. It encodes each agent's declared suboptimality as a linear inequality on the reward function and forms the joint feasible reward set by intersecting all such inequalities. The analysis proves that this joint set shrinks monotonically with each added demonstrator and supplies an exact rule for when the addition strictly reduces the set. Two recovery guarantees are given for the feasible set belonging to the unknown optimal demonstrator: one that uses closeness to the optimal occupancy measure and a second that needs only sufficient state-action coverage. The framework is implemented with an offline algorithm that handles function approximation and is tested in grid worlds and LLM fine-tuning tasks.

Core claim

By treating each demonstrator's suboptimality level as a linear constraint and intersecting the resulting half-spaces, the feasible reward set for the ground-truth optimal demonstrator can be recovered even when no demonstrator is near-optimal, provided the collected trajectories supply sufficient coverage; the joint set contracts monotonically and a new demonstrator tightens it precisely when its constraint cuts the current polytope.

What carries the argument

The feasible reward set formed by intersecting linear suboptimality constraints, one per demonstrator.

If this is right

  • The joint feasible set shrinks monotonically as additional demonstrators are incorporated.
  • A new demonstrator strictly tightens the set if and only if its linear constraint cuts the interior of the current polytope.
  • Recovery of the ground-truth optimal reward set is guaranteed when trajectories are close to the optimal occupancy measure.
  • A second recovery guarantee holds under sufficient coverage alone, without any near-optimal demonstrator.
  • The offline algorithm with function approximation produces a practical reward set usable for downstream policy optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same intersection idea could be applied to preference data where each user supplies an implicit suboptimality bound.
  • In high-dimensional control, the coverage condition might be checked via reachability analysis before claiming recovery.
  • Ambiguity-resolution heuristics proposed for the reward set could be replaced by downstream task performance as a selection criterion.

Load-bearing premise

Each demonstrator's declared suboptimality level can be accurately encoded as a linear constraint on the reward.

What would settle it

An experiment in which adding a new demonstrator with a known suboptimality level fails to shrink the feasible set, or in which the recovered set excludes the true reward despite full coverage of the state-action space.

Figures

Figures reproduced from arXiv: 2605.30903 by Jiawei Zhang, Kihyun Kim, Nikos Vlassis, Shripad Deshmukh.

Figure 1
Figure 1. Figure 1: 2D illustration of (a) the feasible reward set for a single optimal demonstrator (black cone), [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Suboptimality of the recovered reward for the proposed method and the MaxEnt baseline. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LLM experiment on arithmetic tool-use tasks. Left: Example of prompt and the optimal [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Grid-world examples with detour portals for the two cases of suboptimal demonstrators. [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
read the original abstract

Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data come from multiple imperfect demonstrators with heterogeneous suboptimality levels. We study reward learning in this setting through a feasible-reward-set framework: for each demonstrator, we encode its declared suboptimality level as a linear constraint and intersect the resulting feasible sets across demonstrators. Our theoretical analysis shows that the joint feasible set shrinks monotonically as data are added, and we give an exact characterization of when a new demonstrator strictly tightens it. We further establish two recovery guarantees for the feasible reward set of the ground-truth optimal demonstrator: one bound depends on closeness to the optimal occupancy, while the other requires only sufficient coverage and no near-optimal demonstrator. On the practical side, we introduce strategies to address the inherent reward ambiguity in the obtained reward set and provide an offline algorithm with function approximation for high-dimensional environments. Experiments in tabular grid-world and large language model (LLM) fine-tuning settings are consistent with the theoretical predictions and demonstrate the effectiveness of the proposed framework over baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a feasible-reward-set framework for inverse reinforcement learning from multiple demonstrators with heterogeneous but declared suboptimality levels. For each demonstrator the declared suboptimality ε is encoded as a linear constraint on the reward vector (via the difference between the demonstrator's occupancy measure and the optimal one); the joint feasible set is their intersection. The manuscript proves that this joint set shrinks monotonically with additional demonstrators, supplies an exact characterization of when a new demonstrator strictly tightens the set, and states two recovery guarantees for the ground-truth reward of an optimal demonstrator (one depending on closeness to the optimal occupancy, the other requiring only sufficient coverage). Practical contributions include ambiguity-resolution strategies and an offline algorithm with function approximation; experiments in tabular grid-worlds and LLM fine-tuning are reported to be consistent with the theory.

Significance. If the linear-constraint encoding is valid and the occupancy measures are known exactly, the monotonicity result and the two recovery guarantees would constitute a useful theoretical advance for IRL settings that do not assume optimality. The provision of a function-approximation algorithm and the LLM experiment also add practical value for high-dimensional applications.

major comments (2)
  1. [Abstract / §3 (linear constraint definition)] Abstract / linear-constraint construction: the claim that each demonstrator's declared suboptimality level can be encoded exactly as a linear inequality on the reward (via expected-return difference under occupancy measures) is load-bearing for monotonic shrinking, the tightening characterization, and both recovery guarantees. The manuscript must derive this inequality from first principles and state the precise assumptions (exact knowledge of occupancy measures, absence of variance or non-stationarity effects) under which the inequality remains valid and tight; without this derivation the intersection may exclude the ground-truth reward or retain spurious rewards.
  2. [Abstract / recovery guarantees section] Recovery guarantees (abstract): the two stated bounds presuppose that the linear constraints are both valid and tight for the declared ε values. If occupancy estimation error or higher-order effects are present, the feasible set may fail to contain the ground-truth reward; the manuscript should supply a formal statement of the bounds together with the error terms that arise from constraint violation.
minor comments (1)
  1. [Abstract] The abstract refers to 'strategies to address the inherent reward ambiguity' without naming them; a one-sentence description would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the linear-constraint derivation and the recovery guarantees.

read point-by-point responses
  1. Referee: [Abstract / §3 (linear constraint definition)] Abstract / linear-constraint construction: the claim that each demonstrator's declared suboptimality level can be encoded exactly as a linear inequality on the reward (via expected-return difference under occupancy measures) is load-bearing for monotonic shrinking, the tightening characterization, and both recovery guarantees. The manuscript must derive this inequality from first principles and state the precise assumptions (exact knowledge of occupancy measures, absence of variance or non-stationarity effects) under which the inequality remains valid and tight; without this derivation the intersection may exclude the ground-truth reward or retain spurious rewards.

    Authors: We agree that an explicit derivation from first principles is necessary for clarity. The constraint arises directly because the expected return equals the inner product of the reward vector R and the occupancy measure μ; a declared suboptimality ε therefore yields the linear inequality R · (μ* − μ_d) ≤ ε. We will add a dedicated paragraph in §3 that starts from the definition of return, states the exact assumptions (known occupancy measures, stationary policies, and deterministic transitions), and shows why the inequality is valid and tight under those conditions. This revision will also note that the monotonicity and recovery results hold precisely when these assumptions are satisfied. revision: yes

  2. Referee: [Abstract / recovery guarantees section] Recovery guarantees (abstract): the two stated bounds presuppose that the linear constraints are both valid and tight for the declared ε values. If occupancy estimation error or higher-order effects are present, the feasible set may fail to contain the ground-truth reward; the manuscript should supply a formal statement of the bounds together with the error terms that arise from constraint violation.

    Authors: We acknowledge the need to make the dependence on constraint validity explicit. We will augment the recovery-guarantee theorems with a corollary that introduces an additive slack term δ (the maximum violation of any linear constraint) and shows that the ground-truth reward lies inside an enlarged feasible set whose radius grows linearly with δ. The statement will also indicate how the monotonic-shrinkage property is preserved under approximate constraints. These additions will appear in the theoretical section immediately following the original guarantees. revision: yes

Circularity Check

0 steps flagged

No circularity; framework derives results from explicit modeling assumptions

full rationale

The paper constructs the feasible reward set by directly encoding each demonstrator's declared suboptimality ε as a linear inequality on the reward vector (via occupancy measure differences) and intersecting the resulting half-spaces. All claimed properties—monotonic shrinking of the joint set, exact characterization of when a new demonstrator tightens the intersection, and the two recovery guarantees—are standard consequences of this linear feasible-set construction under the modeling assumptions. No equation or guarantee is shown to equal its own inputs by construction, no fitted parameter is relabeled as a prediction, and no self-citation or prior ansatz is invoked as load-bearing justification. The derivation is therefore self-contained conditional on the stated linear encoding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; limited visibility into full set of assumptions.

axioms (1)
  • domain assumption Declared suboptimality levels of demonstrators can be encoded as linear constraints on the reward function.
    Stated directly in the abstract as the basis for constructing feasible sets.

pith-pipeline@v0.9.1-grok · 5732 in / 1150 out tokens · 20871 ms · 2026-06-28T23:56:22.877335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    Algorithms for inverse reinforcement learning

    Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. InInternational Conference on Machine Learning, pages 663–670. PMLR, 2000

  2. [2]

    A survey of inverse reinforcement learning.Artificial Intelligence Review, 55(6):4307–4346, 2022

    Stephen Adams, Tyler Cody, and Peter A Beling. A survey of inverse reinforcement learning.Artificial Intelligence Review, 55(6):4307–4346, 2022

  3. [3]

    Robust reinforcement learning from corrupted human feedback.Advances in Neural Information Processing Systems, 37:124093–124113, 2024

    Alexander Bukharin, Ilgee Hong, Haoming Jiang, Zichong Li, Qingru Zhang, Zixuan Zhang, and Tuo Zhao. Robust reinforcement learning from corrupted human feedback.Advances in Neural Information Processing Systems, 37:124093–124113, 2024

  4. [4]

    Rlhf from heterogeneous feedback via personalization and preference aggregation.arXiv preprint arXiv:2405.00254, 2024

    Chanwoo Park, Mingyang Liu, Dingwen Kong, Kaiqing Zhang, and Asuman Ozdaglar. Rlhf from heterogeneous feedback via personalization and preference aggregation.arXiv preprint arXiv:2405.00254, 2024

  5. [5]

    Diverging preferences: When do annotators disagree and do models know? InInternational Conference on Machine Learning, pages 76193–76212

    Michael Jq Zhang, Zhilin Wang, Jena D Hwang, Yi Dong, Olivier Delalleau, Yejin Choi, Eunsol Choi, Xiang Ren, and Valentina Pyatkin. Diverging preferences: When do annotators disagree and do models know? InInternational Conference on Machine Learning, pages 76193–76212. PMLR, 2025

  6. [6]

    Joint goal and strategy inference across heterogeneous demonstrators via reward network distillation

    Letian Chen, Rohan Paleja, Muyleng Ghuy, and Matthew Gombolay. Joint goal and strategy inference across heterogeneous demonstrators via reward network distillation. InProceedings of the 2020 ACM/IEEE international conference on human-robot interaction, pages 659–668, 2020

  7. [7]

    Learning from suboptimal demonstration via self-supervised reward regression

    Letian Chen, Rohan Paleja, and Matthew Gombolay. Learning from suboptimal demonstration via self-supervised reward regression. InConference on robot learning, pages 1262–1277. PMLR, 2021

  8. [8]

    Droid: Learning from offline heterogeneous demonstrations via reward-policy distillation

    Sravan Jayanthi, Letian Chen, Nadya Balabanska, Van Duong, Erik Scarlatescu, Ezra Ameperosa, Zul- fiqar Haider Zaidi, Daniel Martin, Taylor Keith Del Matto, Masahiro Ono, et al. Droid: Learning from offline heterogeneous demonstrations via reward-policy distillation. InConference on Robot Learning, pages 1547–1571. PMLR, 2023

  9. [9]

    Maximum entropy inverse reinforcement learning

    Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008

  10. [10]

    Bayesian inverse reinforcement learning

    Deepak Ramachandran, Eyal Amir, et al. Bayesian inverse reinforcement learning. InIJCAI, volume 7, pages 2586–2591, 2007

  11. [11]

    Imitation learning by estimating expertise of demonstrators

    Mark Beliaev, Andy Shih, Stefano Ermon, Dorsa Sadigh, and Ramtin Pedarsani. Imitation learning by estimating expertise of demonstrators. InInternational Conference on Machine Learning, pages 1732–1748. PMLR, 2022

  12. [12]

    Inverse reinforcement learning by estimating expertise of demonstra- tors

    Mark Beliaev and Ramtin Pedarsani. Inverse reinforcement learning by estimating expertise of demonstra- tors. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15532–15540, 2025

  13. [13]

    Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations

    Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. InInternational Conference on Machine Learning, pages 783–792. PMLR, 2019

  14. [14]

    Better-than-demonstrator imitation learning via automatically-ranked demonstrations

    Daniel S Brown, Wonjoon Goo, and Scott Niekum. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. InConference on Robot Learning, pages 330–359. PMLR, 2020

  15. [15]

    Provably efficient learning of transferable rewards

    Alberto Maria Metelli, Giorgia Ramponi, Alessandro Concetti, and Marcello Restelli. Provably efficient learning of transferable rewards. InInternational Conference on Machine Learning, pages 7665–7676. PMLR, 2021

  16. [16]

    Towards theoretical understanding of inverse reinforcement learning

    Alberto Maria Metelli, Filippo Lazzati, and Marcello Restelli. Towards theoretical understanding of inverse reinforcement learning. InInternational Conference on Machine Learning, pages 24555–24591. PMLR, 2023

  17. [17]

    A unified linear programming frame- work for offline reward learning from human demonstrations and feedback

    Kihyun Kim, Jiawei Zhang, Asuman E Ozdaglar, and Pablo Parrilo. A unified linear programming frame- work for offline reward learning from human demonstrations and feedback. InInternational Conference on Machine Learning, pages 24694–24712. PMLR, 2024

  18. [18]

    Sub-optimal experts mitigate ambiguity in inverse reinforcement learning.Advances in Neural Information Processing Systems, 37: 85778–85823, 2024

    Riccardo Poiani, Gabriele Curti, Alberto M Metelli, and Marcello Restelli. Sub-optimal experts mitigate ambiguity in inverse reinforcement learning.Advances in Neural Information Processing Systems, 37: 85778–85823, 2024. 10

  19. [19]

    Generalizing behavior via inverse reinforcement learning with closed-form reward centroids.arXiv preprint arXiv:2509.12010, 2025

    Filippo Lazzati and Alberto Maria Metelli. Generalizing behavior via inverse reinforcement learning with closed-form reward centroids.arXiv preprint arXiv:2509.12010, 2025

  20. [20]

    Inverse reinforcement learning from failure

    Kyriacos Shiarlis, Joao Messias, and Shimon Whiteson. Inverse reinforcement learning from failure. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 1060–1068, 2016

  21. [21]

    Maximum margin planning

    Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. InInternational Conference on Machine Learning, pages 729–736, 2006

  22. [22]

    Learning to search: Functional gradient techniques for imitation learning.Autonomous Robots, 27(1):25–53, 2009

    Nathan D Ratliff, David Silver, and J Andrew Bagnell. Learning to search: Functional gradient techniques for imitation learning.Autonomous Robots, 27(1):25–53, 2009

  23. [23]

    Reinforcement Learning from Imperfect Demonstrations

    Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement learning from imperfect demonstrations.arXiv preprint arXiv:1802.05313, 2018

  24. [24]

    Confidence-aware imitation learning from demonstrations with varying optimality.Advances in Neural Information Processing Systems, 34: 12340–12350, 2021

    Songyuan Zhang, Zhangjie Cao, Dorsa Sadigh, and Yanan Sui. Confidence-aware imitation learning from demonstrations with varying optimality.Advances in Neural Information Processing Systems, 34: 12340–12350, 2021

  25. [25]

    John Wiley & Sons, 1994

    Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994

  26. [26]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  27. [27]

    Wiley New York, 1959

    R Duncan Luce et al.Individual choice behavior, volume 4. Wiley New York, 1959

  28. [28]

    SIAM, 2021

    Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski.Lectures on stochastic programming: modeling and theory. SIAM, 2021

  29. [29]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

  30. [30]

    Apprenticeship learning via inverse reinforcement learning

    Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. InInterna- tional Conference on Machine Learning, page 1, 2004

  31. [31]

    A game-theoretic approach to apprenticeship learning.Advances in neural information processing systems, 20, 2007

    Umar Syed and Robert E Schapire. A game-theoretic approach to apprenticeship learning.Advances in neural information processing systems, 20, 2007

  32. [32]

    Apprenticeship learning using linear programming

    Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. InProceedings of the 25th international conference on Machine learning, pages 1032–1039, 2008

  33. [33]

    Relative entropy inverse reinforcement learning

    A Boularias, J Kober, and J Peters. Relative entropy inverse reinforcement learning. InFourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011), pages 182–189. MIT Press, 2011

  34. [34]

    Guided cost learning: Deep inverse optimal control via policy optimization

    Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. InInternational conference on machine learning, pages 49–58. PMLR, 2016

  35. [35]

    Learning Robust Rewards with Adversarial Inverse Reinforcement Learning

    Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning.arXiv preprint arXiv:1710.11248, 2017

  36. [36]

    Active exploration for inverse reinforcement learning.Advances in Neural Information Processing Systems, 35:5843–5853, 2022

    David Lindner, Andreas Krause, and Giorgia Ramponi. Active exploration for inverse reinforcement learning.Advances in Neural Information Processing Systems, 35:5843–5853, 2022

  37. [37]

    Is inverse reinforcement learning harder than standard reinforcement learning? a theoretical perspective

    Lei Zhao, Mengdi Wang, and Yu Bai. Is inverse reinforcement learning harder than standard reinforcement learning? a theoretical perspective. InInternational Conference on Machine Learning, pages 60957–61020. PMLR, 2024

  38. [38]

    On feasible rewards in multi-agent inverse reinforcement learning

    Till Freihaut and Giorgia Ramponi. On feasible rewards in multi-agent inverse reinforcement learning. arXiv preprint arXiv:2411.15046, 2024

  39. [39]

    Bayesian multitask inverse reinforcement learning

    Christos Dimitrakakis and Constantin A Rothkopf. Bayesian multitask inverse reinforcement learning. In European Workshop on Reinforcement Learning, pages 273–284. Springer, 2011. 11

  40. [40]

    Apprenticeship learning about multiple intentions

    Monica Babes, Vukosi Marivate, Kaushik Subramanian, and Michael L Littman. Apprenticeship learning about multiple intentions. InInternational Conference on Machine Learning, pages 897–904, 2011

  41. [41]

    The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

    Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

  42. [42]

    A ranking game for imitation learning.Transactions on machine learning research, 2023

    H Sikchi, A Saran, W Goo, and S Niekum. A ranking game for imitation learning.Transactions on machine learning research, 2023

  43. [43]

    A framework for behavioural cloning

    Michael Bain and Claude Sammut. A framework for behavioural cloning. InMachine intelligence 15, pages 103–129, 1995

  44. [44]

    Efficient reductions for imitation learning

    Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010

  45. [45]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  46. [46]

    Reinforcement and Imitation Learning via Interactive No-Regret Learning

    Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning.arXiv preprint arXiv:1406.5979, 2014

  47. [47]

    Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

    Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

  48. [48]

    Deep rein- forcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  49. [49]

    Reward learning from human preferences and demonstrations in atari.Advances in neural information processing systems, 31, 2018

    Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari.Advances in neural information processing systems, 31, 2018

  50. [50]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019

  51. [51]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  52. [52]

    On approximate solutions of systems of linear inequalities.Journal of Research of the National Bureau of Standards, 49(4), 1952

    Alan J Hoffman. On approximate solutions of systems of linear inequalities.Journal of Research of the National Bureau of Standards, 49(4), 1952

  53. [53]

    New characterizations of hoffman constants for systems of linear constraints.Mathematical Programming, 187(1):79–109, 2021

    Javier Pena, Juan C Vera, and Luis F Zuluaga. New characterizations of hoffman constants for systems of linear constraints.Mathematical Programming, 187(1):79–109, 2021. 12 A Additional related work IRL from a single optimal demonstrator and apprenticeship learning.Classical IRL studies the recovery of a reward function that rationalizes a single optimal ...