Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets

· 2026 · arXiv 2506.05346

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

The Evaluation Game: Beyond Static LLM Benchmarking

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Presents a game-theoretic model with group actions for data augmentation in LLM adversarial evaluation, demonstrating local generalization from fine-tuning on three model families and redefining benchmarks as orbits under group actions.

From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

Benign fine-tuning drifts LLM parameters toward danger directions; SQSD scores each sample by the projection difference of its induced update onto safety versus danger vectors.

The Power of Order: Fooling LLMs with Adversarial Table Permutations

cs.LG · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.

citing papers explorer

Showing 3 of 3 citing papers.

The Evaluation Game: Beyond Static LLM Benchmarking cs.LG · 2026-05-19 · unverdicted · none · ref 25
Presents a game-theoretic model with group actions for data augmentation in LLM adversarial evaluation, demonstrating local generalization from fine-tuning on three model families and redefining benchmarks as orbits under group actions.
From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning cs.AI · 2026-05-06 · unverdicted · none · ref 7
Benign fine-tuning drifts LLM parameters toward danger directions; SQSD scores each sample by the projection difference of its induced update onto safety versus danger vectors.
The Power of Order: Fooling LLMs with Adversarial Table Permutations cs.LG · 2026-05-01 · unverdicted · none · ref 18 · 2 links
Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.

Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer