Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Advances in Neural Information Processing Systems , volume=
9 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 9representative citing papers
VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.
citing papers explorer
-
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
-
VoiceBench: Benchmarking LLM-Based Voice Assistants
VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
-
Common-agency Games for Multi-Objective Test-Time Alignment
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
-
Understanding Annotator Safety Policy with Interpretability
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
-
AlignCultura: Towards Culturally Aligned Large Language Models?
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction
Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.