Implicit Safety Alignment from Crowd Preferences

Daniel S. Brown; Qian Lin

arxiv: 2605.21822 · v1 · pith:652VCRXKnew · submitted 2026-05-20 · 💻 cs.AI

Implicit Safety Alignment from Crowd Preferences

Qian Lin , Daniel S. Brown This is my paper

Pith reviewed 2026-05-22 08:25 UTC · model grok-4.3

classification 💻 cs.AI

keywords safety alignmentcrowd preferencesRLHFhierarchical reinforcement learningimplicit safetysafe RLpreference-based learning

0 comments

The pith

Shared safety principles can be extracted from crowd preferences and reused as composable skills to enforce safety in new RL tasks without explicit safety rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that reinforcement learning from human feedback on crowd preferences embeds common safety criteria even when users pursue different goals. These shared principles can be pulled out as reusable skills rather than mixed directly into task rewards. A hierarchical setup then lets a high-level policy combine those skills to solve downstream tasks safely. If the claim holds, agents can meet safety standards on new problems using only existing preference data instead of new safety annotations or rewards.

Core claim

A hierarchical framework extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to solve downstream RL tasks, lowering safety costs while matching the task performance of oracle methods that have direct access to ground-truth safety signals.

What carries the argument

The hierarchical Safe Crowd Preference-based RL framework that extracts safety-aligned skills from crowd preferences and composes them with a high-level policy.

If this is right

Directly combining a preference-learned reward model with task rewards has inherent limitations for safety.
Extracting and composing safety skills from crowd data achieves safety performance close to methods given explicit safety rewards.
The approach works across standard safe RL benchmarks and a preliminary LLM-style task with varied user goals.
Safety regularization becomes possible without access to separate safety reward functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the extracted skills prove stable, they could be stored in a library and reused across many downstream tasks rather than re-learned each time.
The same extraction step might surface other implicit constraints, such as fairness or resource limits, from the same preference datasets.
Testing whether the learned skills remain effective when user goals shift more dramatically than in the reported experiments would clarify the limits of transfer.

Load-bearing premise

Users with different goals still follow similar underlying safety principles that can be isolated as reusable skills and transferred across tasks.

What would settle it

A controlled experiment on a new safe RL environment in which the hierarchical method produces higher cumulative safety violations than a direct reward-combination baseline or a non-hierarchical preference model.

Figures

Figures reproduced from arXiv: 2605.21822 by Daniel S. Brown, Qian Lin.

**Figure 1.** Figure 1: Predicted rewards on safe and unsafe samples under the imbalanced crowd-preference setting. Safe trajectories (green) receive higher rewards than unsafe ones (red), while trajectories aligned with the majority preference (blue) are over-rewarded compared to minority-preferred ones (yellow). from a task zk exceeds this threshold, the learned model’s preference ordering aligns completely with u(τ, zk). Cons… view at source ↗

**Figure 2.** Figure 2: Performance points of different algorithms and RC (ω) with various weight ω under balanced and imbalanced preference settings. The corresponding numerical results are reported in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Performance on different downstream tasks under imbalanced settings; the x-axis corresponds to task IDs. RC either incurs high safety cost or performs well only on a subset of tasks, whereas our method remains robust across tasks. noticeably away from the Oracle, consistent with the conclusion in Section 4.2 that preference imbalance induces bias in the learned reward, which in turn misguides downstream … view at source ↗

**Figure 4.** Figure 4: Performance under different regularization weights βreg, noise ratios, and crowd sizes. Dashed lines denote Rnorm(Oracle) in the first row and Cnorm(Task-Only) in the second. Our method is robust to βreg and crowd size; while preference noise degrades safety performance, task reward remains largely unaffected. Task-only RC(ω = 0.25) RC(ω = 0.75) Ours Reward 0.95± .01 0.94 ± .01 0.50± .01 0.75 ± .01 Cost 0.… view at source ↗

**Figure 5.** Figure 5: Visualization of the six downstream environments considered in this work. In the left three panels, the red dot denotes the agent; wall constraints are present in the first and third panels, while the blue region in the second panel indicates a risky area. Reach At the beginning of each episode, six hazardous regions are randomly generated. The agent is required to navigate to a target location while avoid… view at source ↗

**Figure 6.** Figure 6: Performance fronts of different algorithms and RC with various trade-off weights ω using balanced and imbalanced crowd preference dataset under the online downstream setting. 0 2 4 1e5 0.4 0.6 0.8 1.0 Reach 0 2 4 1e5 0.85 0.90 0.95 1.00 Run 0 2 4 1e5 0.0 0.2 0.4 0.6 0.8 1.0 Circle 0 2 4 1e5 0.2 0.4 0.6 0.8 Ant-vel 0 2 4 1e5 0.4 0.5 0.6 0.7 0.8 0.9 Swimmer-vel 0 2 4 1e5 0.0 0.2 0.4 0.6 0.8 1.0 HalfCheetah-v… view at source ↗

**Figure 7.** Figure 7: Performance curves of different algorithms during the first 0.5M training steps under online downstream settings. D.4. Ablations on Additional Settings Number of preference sets Npref and size of each preference set Sz. We further analyze the sensitivity of our method to the number of preference sets Npref and the size of each preference set Sz. As shown in [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Performance of our methods under different prior regularization weights βreg ∈ [0.0, 0.01, 0.05, 0.1, 1.0, 10.0] in Eq. 11. 0.0 0.25 0.5 0.75 0.0 0.5 1.0 Circle 0.0 0.25 0.5 0.75 0.0 0.5 1.0 Swimmer-vel 0.0 0.25 0.5 0.75 Flipping Prob 0.0 0.5 1.0 1.5 0.0 0.25 0.5 0.75 Flipping Prob 0.0 0.5 1.0 Norm Reward Norm Cost Safe-VPL Safe-CPL [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Performance of our methods under different preference noise ratios. D.5. Dense and Sparse Annotation for Crowd Preference Data In the main experiments, we adopt a dense annotation setting, where the same Npref trajectory sets is labeled under all M annotation tasks, resulting in Npref × M preference sets in total. In this subsection, we additionally consider a sparse annotation setting, where for each anno… view at source ↗

**Figure 10.** Figure 10: Performance of our methods under different crowd sizes. 250 500 1000 2500 5000 0.0 0.5 1.0 Circle 250 500 1000 2500 5000 0.0 0.5 1.0 Swimmer-vel 250 500 1000 2500 5000 Number of preference sets 0.0 0.5 1.0 250 500 1000 2500 5000 Number of preference sets 0.0 0.5 1.0 Norm Reward Norm Cost Safe-VPL Safe-CPL [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: Performance of our methods under different number of preference sets Npref. E. LLM Evaluation Details Task Description The environment consists of one query, “Please talk about one kind of pets”, together with three categories of responses: bird-related, dog-related, and cat-related responses. Each category contains both safe responses and unsafe responses. Safe responses consist only of normal pet-relate… view at source ↗

**Figure 12.** Figure 12: Performance of our methods under different size of the preference set Sz. 1e3 5e3 1e4 5e4 1e5 5e5 0.0 0.5 1.0 Circle 1e3 5e3 1e4 5e4 1e5 5e5 0.0 0.5 1.0 Swimmer-vel 1e3 5e3 1e4 5e4 1e5 5e5 Training Step 0.0 0.5 1.0 1e3 5e3 1e4 5e4 1e5 5e5 Training Step 0.0 0.5 1.0 Norm Reward Norm Cost Safe-VPL Safe-CPL [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Performance of our methods under different low-level policy training steps. or imitating a single user cannot maximize downstream reward. Instead, successful performance requires composing crowd preferences and generalizing beyond those directly observed in the crowd data. Importantly, downstream training only provides the safety-agnostic reward rnew, without access to the shared safety objective. As a re… view at source ↗

read the original abstract

Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users may express distinct preferences or objectives, yet follow similar safety principles. Our aim is to discover shared safety criteria from crowd preferences and then transfer them to downstream RL tasks to regularize agent behavior and enforce safety. We first show that direct reward combination-optimizing a preference-learned reward model together with downstream task rewards-has inherent limitations. Motivated by this, we propose Safe Crowd Preference-based RL, a hierarchical framework that extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to safely solve downstream tasks. Experiments across safe RL environments and a preliminary LLM-style task with diverse user goals and shared safety constraints demonstrate that our approach substantially lowers safety costs without access to explicit safety rewards, while achieving task performance comparable to oracle methods trained with ground-truth safety signals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The hierarchical extraction of reusable safety skills from crowd preferences is a reasonable practical step but rests on an assumption of clean separability that needs more checking.

read the letter

The main thing to know is that this paper proposes pulling shared safety principles out of crowd preference data as extractable skills, then composing them via a high-level policy to keep downstream RL agents safer without explicit safety rewards. They motivate it by showing that simply mixing a preference-learned reward with task rewards runs into limitations, which is a fair observation when objectives conflict. The framework itself is a straightforward hierarchical split that tries to reuse the safety overlap across users who otherwise want different things. The experiments they describe in safe RL environments and a preliminary LLM-style task report lower safety costs while matching oracle task performance that has ground-truth safety access. That outcome, if it holds, would be useful for anyone trying to bootstrap safety from existing RLHF datasets rather than designing rewards from scratch. What they do well is keep the focus on a real deployment bottleneck and test the idea in more than one setting. The soft spot is the transferability claim. The approach assumes crowd preferences contain reusable, task-independent safety principles that the extraction step can isolate cleanly enough for composition. If those preferences actually entangle safety with specific user goals, or if the skill layer does not produce composable pieces, the gains over direct reward combination could disappear. The abstract flags the direct method's problems but does not yet show side-by-side numbers proving the hierarchy avoids the same entanglement. More detail on the extraction algorithm and ablations would help judge whether the reported drops are robust. This is for researchers working on safe RL and preference-based alignment who want to leverage crowd data without extra safety labeling. A reader focused on practical ways to regularize agent behavior would get something out of the framework and the reported results. It has enough of a concrete proposal and experimental sketch to deserve a serious referee. I would send it for peer review, with the expectation that reviewers press on the separability assumption and the strength of the comparisons.

Referee Report

2 major / 2 minor

Summary. The paper introduces Safe Crowd Preference-based RL, a hierarchical framework that extracts reusable safety-aligned skills from crowd preference data (where users share safety principles despite distinct objectives) and composes them via a high-level policy to regularize downstream RL agents. It first identifies limitations of direct reward combination between a preference-learned reward model and task rewards, then demonstrates through experiments in safe RL environments and a preliminary LLM-style task that the approach reduces safety costs without explicit safety rewards while matching oracle performance.

Significance. If the central results hold, the work provides a concrete mechanism for implicit safety alignment by isolating shared safety criteria as composable skills, addressing a practical gap in RLHF where explicit safety signals are costly or unavailable. The hierarchical separation of safety from task objectives could improve generalization and reduce entanglement compared to flat reward models.

major comments (2)

[Abstract and §3] Abstract and §3 (proposed method): The claim that extracted skills are reusable and task-independent is load-bearing for the hierarchical advantage over direct combination. The skeptic note correctly identifies that if crowd preferences entangle safety with task-specific objectives, the skill extraction step will not yield composable primitives and safety-cost reductions will not materialize. No analysis or ablation is described that tests this disentanglement (e.g., by measuring skill transfer across tasks with deliberately varied objectives).
[Experiments] Experiments section: The abstract states that the method 'substantially lowers safety costs' and achieves 'task performance comparable to oracle methods,' yet the reader's report notes the absence of reported baselines, ablation results, or quantitative metrics (specific safety-cost deltas, variance, or statistical tests). Without these, it is impossible to verify whether the reported gains are robust or sensitive to post-hoc design choices in the skill extractor or high-level policy.

minor comments (2)

[§3] Clarify the precise definition of 'safety-aligned skills' and the objective used to extract them from the preference dataset; the current description leaves open whether the extraction step implicitly uses safety labels or purely unsupervised decomposition.
[Experiments] The preliminary LLM-style task is described only at a high level; adding concrete details on the user goals, shared constraints, and evaluation protocol would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below. Where the comments identify areas for strengthening the claims regarding skill reusability and experimental rigor, we agree and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (proposed method): The claim that extracted skills are reusable and task-independent is load-bearing for the hierarchical advantage over direct combination. The skeptic note correctly identifies that if crowd preferences entangle safety with task-specific objectives, the skill extraction step will not yield composable primitives and safety-cost reductions will not materialize. No analysis or ablation is described that tests this disentanglement (e.g., by measuring skill transfer across tasks with deliberately varied objectives).

Authors: We agree that explicit validation of reusability and task-independence is important for supporting the hierarchical separation. The manuscript demonstrates transfer to downstream tasks with diverse objectives and shared safety constraints, but we acknowledge that dedicated ablations would more directly test disentanglement. In the revised version, we will add an ablation that evaluates the extracted skills on tasks with deliberately varied objectives, reporting safety-cost reductions and task performance to confirm that the skills function as composable, safety-focused primitives independent of specific task goals. revision: yes
Referee: [Experiments] Experiments section: The abstract states that the method 'substantially lowers safety costs' and achieves 'task performance comparable to oracle methods,' yet the reader's report notes the absence of reported baselines, ablation results, or quantitative metrics (specific safety-cost deltas, variance, or statistical tests). Without these, it is impossible to verify whether the reported gains are robust or sensitive to post-hoc design choices in the skill extractor or high-level policy.

Authors: We thank the referee for highlighting the need for more detailed quantitative support. The experiments section already compares against direct reward combination and oracle baselines across safe RL environments and the preliminary LLM-style task. To improve verifiability, the revision will include explicit safety-cost deltas with standard deviations, variance reporting, and statistical tests. We will also add ablation results on the skill extractor and high-level policy to assess sensitivity to design choices. revision: yes

Circularity Check

0 steps flagged

No circularity: forward pipeline from preferences to skills to policy is independent of fitted outputs

full rationale

The paper describes a hierarchical framework that first extracts safety-aligned skills from crowd preference data and then composes them via a high-level policy to regularize downstream RL tasks. The abstract presents this as a forward derivation motivated by limitations of direct reward combination, with experiments claimed to show lowered safety costs. No equations, fitted parameters, or self-citations are quoted that would make any prediction or result equivalent to its inputs by construction. The central claim rests on the empirical transferability of extracted skills rather than a definitional loop or renamed fit. This is a standard non-circular empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that crowd preferences contain extractable, shared safety criteria that generalize across tasks; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Crowd preferences embed shared safety principles even when user objectives differ.
Invoked in the abstract as the foundation for extracting safety-aligned skills from preference datasets.

invented entities (1)

safety-aligned skills no independent evidence
purpose: Reusable lower-level behaviors that enforce safety when composed by a high-level policy.
Introduced as the output of the preference-based extraction step in the proposed framework.

pith-pipeline@v0.9.0 · 5690 in / 1329 out tokens · 29650 ms · 2026-05-22T08:25:37.139264+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 3 internal anchors

[1]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

work page 1998
[2]

arXiv preprint arXiv:2401.10941 , year=

Crowd-PrefRL: Preference-based reward learning from crowds , author=. arXiv preprint arXiv:2401.10941 , year=

work page arXiv
[3]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

work page 2023
[4]

Transactions on Machine Learning Research , year=

Bayesian Methods for Constraint Inference in Reinforcement Learning , author=. Transactions on Machine Learning Research , year=

work page
[5]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Bayesian constraint inference from user demonstrations based on margin-respecting preference models , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

work page 2024
[6]

International conference on machine learning , pages=

Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[7]

Advances in Neural Information Processing Systems , volume=

Personalizing reinforcement learning from human feedback with variational preference learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Personalized soups: Personalized large language model alignment via post-hoc parameter merging , author=. arXiv preprint arXiv:2310.11564 , year=

work page arXiv
[9]

International Conference on Learning Representations , volume=

Distributional preference learning: Understanding and accounting for hidden context in rlhf , author=. International Conference on Learning Representations , volume=

work page
[10]

ICML 2024 Workshop on Models of Human Feedback for AI Alignment , year=

Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences , author=. ICML 2024 Workshop on Models of Human Feedback for AI Alignment , year=

work page 2024
[11]

Pareto-optimal learning from preferences with hidden context , author=

work page
[12]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

work page
[13]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[14]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Conference on Robot Learning , pages=

Few-shot preference learning for human-in-the-loop rl , author=. Conference on Robot Learning , pages=. 2023 , organization=

work page 2023
[16]

Advances in Neural Information Processing Systems , volume=

Direct preference-based policy optimization without reward modeling , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

International Conference on Learning Representations , volume=

Contrastive preference learning: Learning from human feedback without reinforcement learning , author=. International Conference on Learning Representations , volume=

work page
[18]

International Conference on Learning Representations , year=

Safe rlhf: Safe reinforcement learning from human feedback , author=. International Conference on Learning Representations , year=

work page
[19]

, author=

Algorithms for inverse reinforcement learning. , author=. Icml , volume=

work page
[20]

Conference on robot learning , pages=

Learning robot objectives from physical human interaction , author=. Conference on robot learning , pages=. 2017 , organization=

work page 2017
[21]

, author=

The Off-Switch Game. , author=. AAAI Workshops , year=

work page
[22]

the method of paired comparisons , author=

Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

work page 1952
[23]

Journal of Machine Learning Research , volume=

A survey of preference-based reinforcement learning methods , author=. Journal of Machine Learning Research , volume=

work page
[24]

B-pref: Bench- marking preference-based reinforcement learning,

B-pref: Benchmarking preference-based reinforcement learning , author=. arXiv preprint arXiv:2111.03026 , year=

work page arXiv
[25]

Advances in Neural Information Processing Systems , volume=

Learning shared safety constraints from multi-task demonstrations , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

International conference on machine learning , pages=

Inverse constrained reinforcement learning , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[27]

arXiv preprint arXiv:2506.08266 , year=

Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints , author=. arXiv preprint arXiv:2506.08266 , year=

work page arXiv
[28]

arXiv preprint arXiv:2206.02231 , year=

Models of human preference for learning reward functions , author=. arXiv preprint arXiv:2206.02231 , year=

work page arXiv
[29]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Learning optimal advantage from preferences and mistaking it for reward , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[30]

International conference on machine learning , pages=

Addressing function approximation error in actor-critic methods , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[31]

Advances in neural information processing systems , volume=

A minimalist approach to offline reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[32]

Bullet-Safety-Gym: A Framework for Constrained Reinforcement Learning , year =

Gronauer, Sven , institution =. Bullet-Safety-Gym: A Framework for Constrained Reinforcement Learning , year =. doi:10.14459/2022md1639974 , bdsk-url-1 =

work page doi:10.14459/2022md1639974
[33]

Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Safety Gymnasium: A Unified Safe Reinforcement Learning Benchmark , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[34]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[35]

International conference on machine learning , pages=

Constrained decision transformer for offline safe reinforcement learning , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[36]

arXiv preprint arXiv:2101.05982 , year=

Randomized ensembled double q-learning: Learning fast without a model , author=. arXiv preprint arXiv:2101.05982 , year=

work page arXiv
[37]

Diversity is All You Need: Learning Skills without a Reward Function

Diversity is all you need: Learning skills without a reward function , author=. arXiv preprint arXiv:1802.06070 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

International Conference on Machine Learning , pages=

Learning robot skills with temporal variational inference , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[39]

arXiv preprint arXiv:2010.13611 , year=

Opal: Offline primitive discovery for accelerating offline reinforcement learning , author=. arXiv preprint arXiv:2010.13611 , year=

work page arXiv 2010
[40]

Conference on robot learning , pages=

Skill preferences: Learning to extract and execute robotic skills from human feedback , author=. Conference on robot learning , pages=. 2022 , organization=

work page 2022
[41]

arXiv preprint arXiv:2011.10024 , year=

Parrot: Data-driven behavioral priors for reinforcement learning , author=. arXiv preprint arXiv:2011.10024 , year=

work page arXiv 2011
[42]

Conference on robot learning , pages=

Accelerating reinforcement learning with learned skill priors , author=. Conference on robot learning , pages=. 2021 , organization=

work page 2021
[43]

The International Journal of Robotics Research , volume=

Learning movement primitive libraries through probabilistic segmentation , author=. The International Journal of Robotics Research , volume=. 2017 , publisher=

work page 2017
[44]

Proceedings of the AAAI conference on artificial intelligence , volume=

The option-critic architecture , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[45]

Advances in neural information processing systems , volume=

Data-efficient hierarchical reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[46]

Why does hierarchy (sometimes) work so well in reinforcement learning? arXiv preprint arXiv:1909.10618, 2019

Why does hierarchy (sometimes) work so well in reinforcement learning? , author=. arXiv preprint arXiv:1909.10618 , year=

work page arXiv 1909
[47]

International Conference on Learning Representations , year=

Offline Reinforcement Learning with Implicit Q-Learning , author=. International Conference on Learning Representations , year=

work page
[48]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Reinforcement learning and control as probabilistic inference: Tutorial and review , author=. arXiv preprint arXiv:1805.00909 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[50]

Advances in neural information processing systems , volume=

Inverse reward design , author=. Advances in neural information processing systems , volume=

work page
[51]

Specification Gaming: the Flip Side of

Krakovna, Victoria and Uesato, Jonathan and Mikulik, Vladimir and Rahtz, Matthew and Everitt, Tom and Kumar, Ramana and Kenton, Zac and Leike, Jan and Legg, Shane , year=. Specification Gaming: the Flip Side of

work page
[52]

Proceedings of the 36th International Conference on Machine Learning , pages =

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

work page 2019
[53]

Conference on robot learning , pages=

Better-than-demonstrator imitation learning via automatically-ranked demonstrations , author=. Conference on robot learning , pages=. 2020 , organization=

work page 2020
[54]

Benchmarks and Algorithms for Offline Preference-Based Reward Learning , author=

work page
[55]

International Conference on Learning Representations , year=

Causal Confusion and Reward Misidentification in Preference-Based Reward Learning , author=. International Conference on Learning Representations , year=

work page
[56]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[57]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[58]

M. J. Kearns , title =

work page
[59]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[60]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[61]

Suppressed for Anonymity , author=

work page
[62]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[63]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959

[1] [1]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

work page 1998

[2] [2]

arXiv preprint arXiv:2401.10941 , year=

Crowd-PrefRL: Preference-based reward learning from crowds , author=. arXiv preprint arXiv:2401.10941 , year=

work page arXiv

[3] [3]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

work page 2023

[4] [4]

Transactions on Machine Learning Research , year=

Bayesian Methods for Constraint Inference in Reinforcement Learning , author=. Transactions on Machine Learning Research , year=

work page

[5] [5]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Bayesian constraint inference from user demonstrations based on margin-respecting preference models , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

work page 2024

[6] [6]

International conference on machine learning , pages=

Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019

[7] [7]

Advances in Neural Information Processing Systems , volume=

Personalizing reinforcement learning from human feedback with variational preference learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[8] [8]

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Personalized soups: Personalized large language model alignment via post-hoc parameter merging , author=. arXiv preprint arXiv:2310.11564 , year=

work page arXiv

[9] [9]

International Conference on Learning Representations , volume=

Distributional preference learning: Understanding and accounting for hidden context in rlhf , author=. International Conference on Learning Representations , volume=

work page

[10] [10]

ICML 2024 Workshop on Models of Human Feedback for AI Alignment , year=

Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences , author=. ICML 2024 Workshop on Models of Human Feedback for AI Alignment , year=

work page 2024

[11] [11]

Pareto-optimal learning from preferences with hidden context , author=

work page

[12] [12]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

work page

[13] [13]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[14] [14]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Conference on Robot Learning , pages=

Few-shot preference learning for human-in-the-loop rl , author=. Conference on Robot Learning , pages=. 2023 , organization=

work page 2023

[16] [16]

Advances in Neural Information Processing Systems , volume=

Direct preference-based policy optimization without reward modeling , author=. Advances in Neural Information Processing Systems , volume=

work page

[17] [17]

International Conference on Learning Representations , volume=

Contrastive preference learning: Learning from human feedback without reinforcement learning , author=. International Conference on Learning Representations , volume=

work page

[18] [18]

International Conference on Learning Representations , year=

Safe rlhf: Safe reinforcement learning from human feedback , author=. International Conference on Learning Representations , year=

work page

[19] [19]

, author=

Algorithms for inverse reinforcement learning. , author=. Icml , volume=

work page

[20] [20]

Conference on robot learning , pages=

Learning robot objectives from physical human interaction , author=. Conference on robot learning , pages=. 2017 , organization=

work page 2017

[21] [21]

, author=

The Off-Switch Game. , author=. AAAI Workshops , year=

work page

[22] [22]

the method of paired comparisons , author=

Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

work page 1952

[23] [23]

Journal of Machine Learning Research , volume=

A survey of preference-based reinforcement learning methods , author=. Journal of Machine Learning Research , volume=

work page

[24] [24]

B-pref: Bench- marking preference-based reinforcement learning,

B-pref: Benchmarking preference-based reinforcement learning , author=. arXiv preprint arXiv:2111.03026 , year=

work page arXiv

[25] [25]

Advances in Neural Information Processing Systems , volume=

Learning shared safety constraints from multi-task demonstrations , author=. Advances in Neural Information Processing Systems , volume=

work page

[26] [26]

International conference on machine learning , pages=

Inverse constrained reinforcement learning , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[27] [27]

arXiv preprint arXiv:2506.08266 , year=

Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints , author=. arXiv preprint arXiv:2506.08266 , year=

work page arXiv

[28] [28]

arXiv preprint arXiv:2206.02231 , year=

Models of human preference for learning reward functions , author=. arXiv preprint arXiv:2206.02231 , year=

work page arXiv

[29] [29]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Learning optimal advantage from preferences and mistaking it for reward , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[30] [30]

International conference on machine learning , pages=

Addressing function approximation error in actor-critic methods , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018

[31] [31]

Advances in neural information processing systems , volume=

A minimalist approach to offline reinforcement learning , author=. Advances in neural information processing systems , volume=

work page

[32] [32]

Bullet-Safety-Gym: A Framework for Constrained Reinforcement Learning , year =

Gronauer, Sven , institution =. Bullet-Safety-Gym: A Framework for Constrained Reinforcement Learning , year =. doi:10.14459/2022md1639974 , bdsk-url-1 =

work page doi:10.14459/2022md1639974

[33] [33]

Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Safety Gymnasium: A Unified Safe Reinforcement Learning Benchmark , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page

[34] [34]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018

[35] [35]

International conference on machine learning , pages=

Constrained decision transformer for offline safe reinforcement learning , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023

[36] [36]

arXiv preprint arXiv:2101.05982 , year=

Randomized ensembled double q-learning: Learning fast without a model , author=. arXiv preprint arXiv:2101.05982 , year=

work page arXiv

[37] [37]

Diversity is All You Need: Learning Skills without a Reward Function

Diversity is all you need: Learning skills without a reward function , author=. arXiv preprint arXiv:1802.06070 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

International Conference on Machine Learning , pages=

Learning robot skills with temporal variational inference , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020

[39] [39]

arXiv preprint arXiv:2010.13611 , year=

Opal: Offline primitive discovery for accelerating offline reinforcement learning , author=. arXiv preprint arXiv:2010.13611 , year=

work page arXiv 2010

[40] [40]

Conference on robot learning , pages=

Skill preferences: Learning to extract and execute robotic skills from human feedback , author=. Conference on robot learning , pages=. 2022 , organization=

work page 2022

[41] [41]

arXiv preprint arXiv:2011.10024 , year=

Parrot: Data-driven behavioral priors for reinforcement learning , author=. arXiv preprint arXiv:2011.10024 , year=

work page arXiv 2011

[42] [42]

Conference on robot learning , pages=

Accelerating reinforcement learning with learned skill priors , author=. Conference on robot learning , pages=. 2021 , organization=

work page 2021

[43] [43]

The International Journal of Robotics Research , volume=

Learning movement primitive libraries through probabilistic segmentation , author=. The International Journal of Robotics Research , volume=. 2017 , publisher=

work page 2017

[44] [44]

Proceedings of the AAAI conference on artificial intelligence , volume=

The option-critic architecture , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[45] [45]

Advances in neural information processing systems , volume=

Data-efficient hierarchical reinforcement learning , author=. Advances in neural information processing systems , volume=

work page

[46] [46]

Why does hierarchy (sometimes) work so well in reinforcement learning? arXiv preprint arXiv:1909.10618, 2019

Why does hierarchy (sometimes) work so well in reinforcement learning? , author=. arXiv preprint arXiv:1909.10618 , year=

work page arXiv 1909

[47] [47]

International Conference on Learning Representations , year=

Offline Reinforcement Learning with Implicit Q-Learning , author=. International Conference on Learning Representations , year=

work page

[48] [48]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Reinforcement learning and control as probabilistic inference: Tutorial and review , author=. arXiv preprint arXiv:1805.00909 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[50] [50]

Advances in neural information processing systems , volume=

Inverse reward design , author=. Advances in neural information processing systems , volume=

work page

[51] [51]

Specification Gaming: the Flip Side of

Krakovna, Victoria and Uesato, Jonathan and Mikulik, Vladimir and Rahtz, Matthew and Everitt, Tom and Kumar, Ramana and Kenton, Zac and Leike, Jan and Legg, Shane , year=. Specification Gaming: the Flip Side of

work page

[52] [52]

Proceedings of the 36th International Conference on Machine Learning , pages =

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

work page 2019

[53] [53]

Conference on robot learning , pages=

Better-than-demonstrator imitation learning via automatically-ranked demonstrations , author=. Conference on robot learning , pages=. 2020 , organization=

work page 2020

[54] [54]

Benchmarks and Algorithms for Offline Preference-Based Reward Learning , author=

work page

[55] [55]

International Conference on Learning Representations , year=

Causal Confusion and Reward Misidentification in Preference-Based Reward Learning , author=. International Conference on Learning Representations , year=

work page

[56] [56]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000

[57] [57]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980

[58] [58]

M. J. Kearns , title =

work page

[59] [59]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983

[60] [60]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000

[61] [61]

Suppressed for Anonymity , author=

work page

[62] [62]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981

[63] [63]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959