SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
Safedpo: A simple approach to direct preference optimization with enhanced safety,
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9roles
background 1polarities
background 1representative citing papers
First convergence analysis of DPO under federated and decentralized training, characterizing rates via client drift, communication frequency, preference heterogeneity, and graph spectral connectivity.
SMT achieves the highest attack success rate and HarmScore on commercial function-calling LLMs from five providers by using simulated moderation traces in multi-turn trajectories, outperforming baselines with near-minimal queries.
Grammar-constrained decoding enables a new jailbreak (CodeSpear) on LLMs for malicious code, countered by CodeShield which trains models to output harmless honeypot code under GCD while preserving refusals.
Value-filtered decoding steers LLM outputs for safety at decoding time using a value criterion with an explicit bound on false interventions controlled by one threshold hyperparameter.
MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
DOG-DPO selects 11% of preference pairs via geometric subspace decomposition to recover most safety gains of full-data DPO training across six benchmarks.
PREFINE adapts Direct Preference Optimization to trajectory-level preferences in RL for joint reward retention and safety alignment in continuous domains.
citing papers explorer
-
Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
-
Distributed Direct Preference Optimization
First convergence analysis of DPO under federated and decentralized training, characterizing rates via client drift, communication frequency, preference heterogeneity, and graph spectral connectivity.
-
Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code
Grammar-constrained decoding enables a new jailbreak (CodeSpear) on LLMs for malicious code, countered by CodeShield which trains models to output harmless honeypot code under GCD while preserving refusals.
-
Selective Safety Steering via Value-Filtered Decoding
Value-filtered decoding steers LLM outputs for safety at decoding time using a value criterion with an explicit bound on false interventions controlled by one threshold hyperparameter.
-
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment
MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.
-
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
-
DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment
DOG-DPO selects 11% of preference pairs via geometric subspace decomposition to recover most safety gains of full-data DPO training across six benchmarks.
-
PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment
PREFINE adapts Direct Preference Optimization to trajectory-level preferences in RL for joint reward retention and safety alignment in continuous domains.