Safedpo: A simple approach to direct preference optimization with enhanced safety,

· 2025 · arXiv 2505.20065

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.

Distributed Direct Preference Optimization

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

First convergence analysis of DPO under federated and decentralized training, characterizing rates via client drift, communication frequency, preference heterogeneity, and graph spectral connectivity.

Beyond the Prompt: Jailbreaking Function-Calling LLMs via Simulated Moderation Traces

cs.CR · 2026-07-01 · conditional · novelty 6.0

SMT achieves the highest attack success rate and HarmScore on commercial function-calling LLMs from five providers by using simulated moderation traces in multi-turn trajectories, outperforming baselines with near-minimal queries.

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

cs.CR · 2026-06-10 · unverdicted · novelty 6.0

Grammar-constrained decoding enables a new jailbreak (CodeSpear) on LLMs for malicious code, countered by CodeShield which trains models to output harmless honeypot code under GCD while preserving refusals.

Selective Safety Steering via Value-Filtered Decoding

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Value-filtered decoding steers LLM outputs for safety at decoding time using a value criterion with an explicit bound on false interventions controlled by one threshold hyperparameter.

MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

cs.LG · 2026-06-04 · unverdicted · novelty 5.0

DOG-DPO selects 11% of preference pairs via geometric subspace decomposition to recover most safety gains of full-data DPO training across six benchmarks.

PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

PREFINE adapts Direct Preference Optimization to trajectory-level preferences in RL for joint reward retention and safety alignment in continuous domains.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning cs.CL · 2026-05-30 · unverdicted · none · ref 52
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
Distributed Direct Preference Optimization cs.LG · 2026-05-20 · unverdicted · none · ref 6
First convergence analysis of DPO under federated and decentralized training, characterizing rates via client drift, communication frequency, preference heterogeneity, and graph spectral connectivity.
Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code cs.CR · 2026-06-10 · unverdicted · none · ref 44
Grammar-constrained decoding enables a new jailbreak (CodeSpear) on LLMs for malicious code, countered by CodeShield which trains models to output harmless honeypot code under GCD while preserving refusals.
Selective Safety Steering via Value-Filtered Decoding cs.LG · 2026-05-14 · unverdicted · none · ref 10
Value-filtered decoding steers LLM outputs for safety at decoding time using a value criterion with an explicit bound on false interventions controlled by one threshold hyperparameter.
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment cs.LG · 2026-04-22 · unverdicted · none · ref 81
MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training cs.CR · 2026-04-09 · unverdicted · none · ref 33
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment cs.LG · 2026-06-04 · unverdicted · none · ref 3
DOG-DPO selects 11% of preference pairs via geometric subspace decomposition to recover most safety gains of full-data DPO training across six benchmarks.
PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment cs.LG · 2026-05-20 · unverdicted · none · ref 11
PREFINE adapts Direct Preference Optimization to trajectory-level preferences in RL for joint reward retention and safety alignment in continuous domains.

Safedpo: A simple approach to direct preference optimization with enhanced safety,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer