Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection

Seal: Safety-enhanced aligned llm finetuning via bilevel data selection · 2024 · arXiv 2410.07471

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.

BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.

Defending Against Harmful Supervision Hidden in Benign Samples

cs.CR · 2026-06-29 · unverdicted · novelty 5.0

The paper proposes Dual-Reference SFT (DR-SFT) to defend LLMs against harmful QA pairs embedded in benign training samples, where existing guardrails fail at the example level.

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

cs.CR · 2024-09-26 · unverdicted · novelty 2.0

Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

citing papers explorer

Showing 4 of 4 citing papers.

Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection cs.LG · 2026-05-27 · unverdicted · none · ref 45
Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization cs.LG · 2026-05-11 · unverdicted · none · ref 50 · 2 links
BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
Defending Against Harmful Supervision Hidden in Benign Samples cs.CR · 2026-06-29 · unverdicted · none · ref 5
The paper proposes Dual-Reference SFT (DR-SFT) to defend LLMs against harmful QA pairs embedded in benign training samples, where existing guardrails fail at the example level.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey cs.CR · 2024-09-26 · unverdicted · none · ref 133
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer