pith. sign in

arxiv: 2605.21822 · v1 · pith:652VCRXKnew · submitted 2026-05-20 · 💻 cs.AI

Implicit Safety Alignment from Crowd Preferences

Pith reviewed 2026-05-22 08:25 UTC · model grok-4.3

classification 💻 cs.AI
keywords safety alignmentcrowd preferencesRLHFhierarchical reinforcement learningimplicit safetysafe RLpreference-based learning
0
0 comments X

The pith

Shared safety principles can be extracted from crowd preferences and reused as composable skills to enforce safety in new RL tasks without explicit safety rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that reinforcement learning from human feedback on crowd preferences embeds common safety criteria even when users pursue different goals. These shared principles can be pulled out as reusable skills rather than mixed directly into task rewards. A hierarchical setup then lets a high-level policy combine those skills to solve downstream tasks safely. If the claim holds, agents can meet safety standards on new problems using only existing preference data instead of new safety annotations or rewards.

Core claim

A hierarchical framework extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to solve downstream RL tasks, lowering safety costs while matching the task performance of oracle methods that have direct access to ground-truth safety signals.

What carries the argument

The hierarchical Safe Crowd Preference-based RL framework that extracts safety-aligned skills from crowd preferences and composes them with a high-level policy.

If this is right

  • Directly combining a preference-learned reward model with task rewards has inherent limitations for safety.
  • Extracting and composing safety skills from crowd data achieves safety performance close to methods given explicit safety rewards.
  • The approach works across standard safe RL benchmarks and a preliminary LLM-style task with varied user goals.
  • Safety regularization becomes possible without access to separate safety reward functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the extracted skills prove stable, they could be stored in a library and reused across many downstream tasks rather than re-learned each time.
  • The same extraction step might surface other implicit constraints, such as fairness or resource limits, from the same preference datasets.
  • Testing whether the learned skills remain effective when user goals shift more dramatically than in the reported experiments would clarify the limits of transfer.

Load-bearing premise

Users with different goals still follow similar underlying safety principles that can be isolated as reusable skills and transferred across tasks.

What would settle it

A controlled experiment on a new safe RL environment in which the hierarchical method produces higher cumulative safety violations than a direct reward-combination baseline or a non-hierarchical preference model.

Figures

Figures reproduced from arXiv: 2605.21822 by Daniel S. Brown, Qian Lin.

Figure 1
Figure 1. Figure 1: Predicted rewards on safe and unsafe samples under the imbalanced crowd-preference setting. Safe trajectories (green) receive higher rewards than unsafe ones (red), while trajectories aligned with the majority preference (blue) are over-rewarded compared to minority-preferred ones (yellow). from a task zk exceeds this threshold, the learned model’s preference ordering aligns completely with u(τ, zk). Con￾s… view at source ↗
Figure 2
Figure 2. Figure 2: Performance points of different algorithms and RC (ω) with various weight ω under balanced and imbalanced preference settings. The corresponding numerical results are reported in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance on different downstream tasks under im￾balanced settings; the x-axis corresponds to task IDs. RC either incurs high safety cost or performs well only on a subset of tasks, whereas our method remains robust across tasks. noticeably away from the Oracle, consistent with the conclu￾sion in Section 4.2 that preference imbalance induces bias in the learned reward, which in turn misguides downstream … view at source ↗
Figure 4
Figure 4. Figure 4: Performance under different regularization weights βreg, noise ratios, and crowd sizes. Dashed lines denote Rnorm(Oracle) in the first row and Cnorm(Task-Only) in the second. Our method is robust to βreg and crowd size; while preference noise degrades safety performance, task reward remains largely unaffected. Task-only RC(ω = 0.25) RC(ω = 0.75) Ours Reward 0.95± .01 0.94 ± .01 0.50± .01 0.75 ± .01 Cost 0.… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the six downstream environments considered in this work. In the left three panels, the red dot denotes the agent; wall constraints are present in the first and third panels, while the blue region in the second panel indicates a risky area. Reach At the beginning of each episode, six hazardous regions are randomly generated. The agent is required to navigate to a target location while avoid… view at source ↗
Figure 6
Figure 6. Figure 6: Performance fronts of different algorithms and RC with various trade-off weights ω using balanced and imbalanced crowd preference dataset under the online downstream setting. 0 2 4 1e5 0.4 0.6 0.8 1.0 Reach 0 2 4 1e5 0.85 0.90 0.95 1.00 Run 0 2 4 1e5 0.0 0.2 0.4 0.6 0.8 1.0 Circle 0 2 4 1e5 0.2 0.4 0.6 0.8 Ant-vel 0 2 4 1e5 0.4 0.5 0.6 0.7 0.8 0.9 Swimmer-vel 0 2 4 1e5 0.0 0.2 0.4 0.6 0.8 1.0 HalfCheetah-v… view at source ↗
Figure 7
Figure 7. Figure 7: Performance curves of different algorithms during the first 0.5M training steps under online downstream settings. D.4. Ablations on Additional Settings Number of preference sets Npref and size of each preference set Sz. We further analyze the sensitivity of our method to the number of preference sets Npref and the size of each preference set Sz. As shown in [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance of our methods under different prior regularization weights βreg ∈ [0.0, 0.01, 0.05, 0.1, 1.0, 10.0] in Eq. 11. 0.0 0.25 0.5 0.75 0.0 0.5 1.0 Circle 0.0 0.25 0.5 0.75 0.0 0.5 1.0 Swimmer-vel 0.0 0.25 0.5 0.75 Flipping Prob 0.0 0.5 1.0 1.5 0.0 0.25 0.5 0.75 Flipping Prob 0.0 0.5 1.0 Norm Reward Norm Cost Safe-VPL Safe-CPL [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance of our methods under different preference noise ratios. D.5. Dense and Sparse Annotation for Crowd Preference Data In the main experiments, we adopt a dense annotation setting, where the same Npref trajectory sets is labeled under all M annotation tasks, resulting in Npref × M preference sets in total. In this subsection, we additionally consider a sparse annotation setting, where for each anno… view at source ↗
Figure 10
Figure 10. Figure 10: Performance of our methods under different crowd sizes. 250 500 1000 2500 5000 0.0 0.5 1.0 Circle 250 500 1000 2500 5000 0.0 0.5 1.0 Swimmer-vel 250 500 1000 2500 5000 Number of preference sets 0.0 0.5 1.0 250 500 1000 2500 5000 Number of preference sets 0.0 0.5 1.0 Norm Reward Norm Cost Safe-VPL Safe-CPL [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance of our methods under different number of preference sets Npref. E. LLM Evaluation Details Task Description The environment consists of one query, “Please talk about one kind of pets”, together with three categories of responses: bird-related, dog-related, and cat-related responses. Each category contains both safe responses and unsafe responses. Safe responses consist only of normal pet-relate… view at source ↗
Figure 12
Figure 12. Figure 12: Performance of our methods under different size of the preference set Sz. 1e3 5e3 1e4 5e4 1e5 5e5 0.0 0.5 1.0 Circle 1e3 5e3 1e4 5e4 1e5 5e5 0.0 0.5 1.0 Swimmer-vel 1e3 5e3 1e4 5e4 1e5 5e5 Training Step 0.0 0.5 1.0 1e3 5e3 1e4 5e4 1e5 5e5 Training Step 0.0 0.5 1.0 Norm Reward Norm Cost Safe-VPL Safe-CPL [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Performance of our methods under different low-level policy training steps. or imitating a single user cannot maximize downstream reward. Instead, successful performance requires composing crowd preferences and generalizing beyond those directly observed in the crowd data. Importantly, downstream training only provides the safety-agnostic reward rnew, without access to the shared safety objective. As a re… view at source ↗
read the original abstract

Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users may express distinct preferences or objectives, yet follow similar safety principles. Our aim is to discover shared safety criteria from crowd preferences and then transfer them to downstream RL tasks to regularize agent behavior and enforce safety. We first show that direct reward combination-optimizing a preference-learned reward model together with downstream task rewards-has inherent limitations. Motivated by this, we propose Safe Crowd Preference-based RL, a hierarchical framework that extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to safely solve downstream tasks. Experiments across safe RL environments and a preliminary LLM-style task with diverse user goals and shared safety constraints demonstrate that our approach substantially lowers safety costs without access to explicit safety rewards, while achieving task performance comparable to oracle methods trained with ground-truth safety signals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Safe Crowd Preference-based RL, a hierarchical framework that extracts reusable safety-aligned skills from crowd preference data (where users share safety principles despite distinct objectives) and composes them via a high-level policy to regularize downstream RL agents. It first identifies limitations of direct reward combination between a preference-learned reward model and task rewards, then demonstrates through experiments in safe RL environments and a preliminary LLM-style task that the approach reduces safety costs without explicit safety rewards while matching oracle performance.

Significance. If the central results hold, the work provides a concrete mechanism for implicit safety alignment by isolating shared safety criteria as composable skills, addressing a practical gap in RLHF where explicit safety signals are costly or unavailable. The hierarchical separation of safety from task objectives could improve generalization and reduce entanglement compared to flat reward models.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (proposed method): The claim that extracted skills are reusable and task-independent is load-bearing for the hierarchical advantage over direct combination. The skeptic note correctly identifies that if crowd preferences entangle safety with task-specific objectives, the skill extraction step will not yield composable primitives and safety-cost reductions will not materialize. No analysis or ablation is described that tests this disentanglement (e.g., by measuring skill transfer across tasks with deliberately varied objectives).
  2. [Experiments] Experiments section: The abstract states that the method 'substantially lowers safety costs' and achieves 'task performance comparable to oracle methods,' yet the reader's report notes the absence of reported baselines, ablation results, or quantitative metrics (specific safety-cost deltas, variance, or statistical tests). Without these, it is impossible to verify whether the reported gains are robust or sensitive to post-hoc design choices in the skill extractor or high-level policy.
minor comments (2)
  1. [§3] Clarify the precise definition of 'safety-aligned skills' and the objective used to extract them from the preference dataset; the current description leaves open whether the extraction step implicitly uses safety labels or purely unsupervised decomposition.
  2. [Experiments] The preliminary LLM-style task is described only at a high level; adding concrete details on the user goals, shared constraints, and evaluation protocol would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below. Where the comments identify areas for strengthening the claims regarding skill reusability and experimental rigor, we agree and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (proposed method): The claim that extracted skills are reusable and task-independent is load-bearing for the hierarchical advantage over direct combination. The skeptic note correctly identifies that if crowd preferences entangle safety with task-specific objectives, the skill extraction step will not yield composable primitives and safety-cost reductions will not materialize. No analysis or ablation is described that tests this disentanglement (e.g., by measuring skill transfer across tasks with deliberately varied objectives).

    Authors: We agree that explicit validation of reusability and task-independence is important for supporting the hierarchical separation. The manuscript demonstrates transfer to downstream tasks with diverse objectives and shared safety constraints, but we acknowledge that dedicated ablations would more directly test disentanglement. In the revised version, we will add an ablation that evaluates the extracted skills on tasks with deliberately varied objectives, reporting safety-cost reductions and task performance to confirm that the skills function as composable, safety-focused primitives independent of specific task goals. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract states that the method 'substantially lowers safety costs' and achieves 'task performance comparable to oracle methods,' yet the reader's report notes the absence of reported baselines, ablation results, or quantitative metrics (specific safety-cost deltas, variance, or statistical tests). Without these, it is impossible to verify whether the reported gains are robust or sensitive to post-hoc design choices in the skill extractor or high-level policy.

    Authors: We thank the referee for highlighting the need for more detailed quantitative support. The experiments section already compares against direct reward combination and oracle baselines across safe RL environments and the preliminary LLM-style task. To improve verifiability, the revision will include explicit safety-cost deltas with standard deviations, variance reporting, and statistical tests. We will also add ablation results on the skill extractor and high-level policy to assess sensitivity to design choices. revision: yes

Circularity Check

0 steps flagged

No circularity: forward pipeline from preferences to skills to policy is independent of fitted outputs

full rationale

The paper describes a hierarchical framework that first extracts safety-aligned skills from crowd preference data and then composes them via a high-level policy to regularize downstream RL tasks. The abstract presents this as a forward derivation motivated by limitations of direct reward combination, with experiments claimed to show lowered safety costs. No equations, fitted parameters, or self-citations are quoted that would make any prediction or result equivalent to its inputs by construction. The central claim rests on the empirical transferability of extracted skills rather than a definitional loop or renamed fit. This is a standard non-circular empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that crowd preferences contain extractable, shared safety criteria that generalize across tasks; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption Crowd preferences embed shared safety principles even when user objectives differ.
    Invoked in the abstract as the foundation for extracting safety-aligned skills from preference datasets.
invented entities (1)
  • safety-aligned skills no independent evidence
    purpose: Reusable lower-level behaviors that enforce safety when composed by a high-level policy.
    Introduced as the output of the preference-based extraction step in the proposed framework.

pith-pipeline@v0.9.0 · 5690 in / 1329 out tokens · 29650 ms · 2026-05-22T08:25:37.139264+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 3 internal anchors

  1. [1]

    1998 , publisher=

    Reinforcement learning: An introduction , author=. 1998 , publisher=

  2. [2]

    arXiv preprint arXiv:2401.10941 , year=

    Crowd-PrefRL: Preference-based reward learning from crowds , author=. arXiv preprint arXiv:2401.10941 , year=

  3. [3]

    2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

  4. [4]

    Transactions on Machine Learning Research , year=

    Bayesian Methods for Constraint Inference in Reinforcement Learning , author=. Transactions on Machine Learning Research , year=

  5. [5]

    2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Bayesian constraint inference from user demonstrations based on margin-respecting preference models , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

  6. [6]

    International conference on machine learning , pages=

    Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations , author=. International conference on machine learning , pages=. 2019 , organization=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    Personalizing reinforcement learning from human feedback with variational preference learning , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

    Personalized soups: Personalized large language model alignment via post-hoc parameter merging , author=. arXiv preprint arXiv:2310.11564 , year=

  9. [9]

    International Conference on Learning Representations , volume=

    Distributional preference learning: Understanding and accounting for hidden context in rlhf , author=. International Conference on Learning Representations , volume=

  10. [10]

    ICML 2024 Workshop on Models of Human Feedback for AI Alignment , year=

    Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences , author=. ICML 2024 Workshop on Models of Human Feedback for AI Alignment , year=

  11. [11]

    Pareto-optimal learning from preferences with hidden context , author=

  12. [12]

    Advances in neural information processing systems , volume=

    Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

  13. [13]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  14. [14]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

  15. [15]

    Conference on Robot Learning , pages=

    Few-shot preference learning for human-in-the-loop rl , author=. Conference on Robot Learning , pages=. 2023 , organization=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Direct preference-based policy optimization without reward modeling , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    International Conference on Learning Representations , volume=

    Contrastive preference learning: Learning from human feedback without reinforcement learning , author=. International Conference on Learning Representations , volume=

  18. [18]

    International Conference on Learning Representations , year=

    Safe rlhf: Safe reinforcement learning from human feedback , author=. International Conference on Learning Representations , year=

  19. [19]

    , author=

    Algorithms for inverse reinforcement learning. , author=. Icml , volume=

  20. [20]

    Conference on robot learning , pages=

    Learning robot objectives from physical human interaction , author=. Conference on robot learning , pages=. 2017 , organization=

  21. [21]

    , author=

    The Off-Switch Game. , author=. AAAI Workshops , year=

  22. [22]

    the method of paired comparisons , author=

    Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

  23. [23]

    Journal of Machine Learning Research , volume=

    A survey of preference-based reinforcement learning methods , author=. Journal of Machine Learning Research , volume=

  24. [24]

    B-pref: Bench- marking preference-based reinforcement learning,

    B-pref: Benchmarking preference-based reinforcement learning , author=. arXiv preprint arXiv:2111.03026 , year=

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Learning shared safety constraints from multi-task demonstrations , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    International conference on machine learning , pages=

    Inverse constrained reinforcement learning , author=. International conference on machine learning , pages=. 2021 , organization=

  27. [27]

    arXiv preprint arXiv:2506.08266 , year=

    Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints , author=. arXiv preprint arXiv:2506.08266 , year=

  28. [28]

    arXiv preprint arXiv:2206.02231 , year=

    Models of human preference for learning reward functions , author=. arXiv preprint arXiv:2206.02231 , year=

  29. [29]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Learning optimal advantage from preferences and mistaking it for reward , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  30. [30]

    International conference on machine learning , pages=

    Addressing function approximation error in actor-critic methods , author=. International conference on machine learning , pages=. 2018 , organization=

  31. [31]

    Advances in neural information processing systems , volume=

    A minimalist approach to offline reinforcement learning , author=. Advances in neural information processing systems , volume=

  32. [32]

    Bullet-Safety-Gym: A Framework for Constrained Reinforcement Learning , year =

    Gronauer, Sven , institution =. Bullet-Safety-Gym: A Framework for Constrained Reinforcement Learning , year =. doi:10.14459/2022md1639974 , bdsk-url-1 =

  33. [33]

    Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Safety Gymnasium: A Unified Safe Reinforcement Learning Benchmark , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  34. [34]

    International conference on machine learning , pages=

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

  35. [35]

    International conference on machine learning , pages=

    Constrained decision transformer for offline safe reinforcement learning , author=. International conference on machine learning , pages=. 2023 , organization=

  36. [36]

    arXiv preprint arXiv:2101.05982 , year=

    Randomized ensembled double q-learning: Learning fast without a model , author=. arXiv preprint arXiv:2101.05982 , year=

  37. [37]

    Diversity is All You Need: Learning Skills without a Reward Function

    Diversity is all you need: Learning skills without a reward function , author=. arXiv preprint arXiv:1802.06070 , year=

  38. [38]

    International Conference on Machine Learning , pages=

    Learning robot skills with temporal variational inference , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  39. [39]

    arXiv preprint arXiv:2010.13611 , year=

    Opal: Offline primitive discovery for accelerating offline reinforcement learning , author=. arXiv preprint arXiv:2010.13611 , year=

  40. [40]

    Conference on robot learning , pages=

    Skill preferences: Learning to extract and execute robotic skills from human feedback , author=. Conference on robot learning , pages=. 2022 , organization=

  41. [41]

    arXiv preprint arXiv:2011.10024 , year=

    Parrot: Data-driven behavioral priors for reinforcement learning , author=. arXiv preprint arXiv:2011.10024 , year=

  42. [42]

    Conference on robot learning , pages=

    Accelerating reinforcement learning with learned skill priors , author=. Conference on robot learning , pages=. 2021 , organization=

  43. [43]

    The International Journal of Robotics Research , volume=

    Learning movement primitive libraries through probabilistic segmentation , author=. The International Journal of Robotics Research , volume=. 2017 , publisher=

  44. [44]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    The option-critic architecture , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  45. [45]

    Advances in neural information processing systems , volume=

    Data-efficient hierarchical reinforcement learning , author=. Advances in neural information processing systems , volume=

  46. [46]

    Why does hierarchy (sometimes) work so well in reinforcement learning? arXiv preprint arXiv:1909.10618, 2019

    Why does hierarchy (sometimes) work so well in reinforcement learning? , author=. arXiv preprint arXiv:1909.10618 , year=

  47. [47]

    International Conference on Learning Representations , year=

    Offline Reinforcement Learning with Implicit Q-Learning , author=. International Conference on Learning Representations , year=

  48. [48]

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

    Reinforcement learning and control as probabilistic inference: Tutorial and review , author=. arXiv preprint arXiv:1805.00909 , year=

  49. [49]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  50. [50]

    Advances in neural information processing systems , volume=

    Inverse reward design , author=. Advances in neural information processing systems , volume=

  51. [51]

    Specification Gaming: the Flip Side of

    Krakovna, Victoria and Uesato, Jonathan and Mikulik, Vladimir and Rahtz, Matthew and Everitt, Tom and Kumar, Ramana and Kenton, Zac and Leike, Jan and Legg, Shane , year=. Specification Gaming: the Flip Side of

  52. [52]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

  53. [53]

    Conference on robot learning , pages=

    Better-than-demonstrator imitation learning via automatically-ranked demonstrations , author=. Conference on robot learning , pages=. 2020 , organization=

  54. [54]

    Benchmarks and Algorithms for Offline Preference-Based Reward Learning , author=

  55. [55]

    International Conference on Learning Representations , year=

    Causal Confusion and Reward Misidentification in Preference-Based Reward Learning , author=. International Conference on Learning Representations , year=

  56. [56]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  57. [57]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  58. [58]

    M. J. Kearns , title =

  59. [59]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  60. [60]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  61. [61]

    Suppressed for Anonymity , author=

  62. [62]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  63. [63]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959