Mixed citations

arXiv preprint arXiv:2310.02949 , year=

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models , author= · 2023 · arXiv 2310.02949

Mixed citation behavior. Most common role is background (60%).

35 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 35 citing papers

citation-role summary

background 4 method 1

citation-polarity summary

background 3 support 2

representative citing papers

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness

cs.LG · 2026-06-14 · unverdicted · novelty 8.0

Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

ALIGNBEAM transfers safety alignment across LLMs with different vocabularies at inference time via cross-vocabulary logit mixing and judge-based selection.

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

Video MLLMs show higher jailbreak rates with multi-clip videos than images or static videos, with success increasing alongside clip count and contextual diversity.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

cs.CR · 2026-05-14 · unverdicted · novelty 7.0

A unified adaptive attack exploits the common weakness across 15 defenses against malicious fine-tuning, showing they only obscure rather than remove harmful model capabilities.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

LLM Agents can Autonomously Exploit One-day Vulnerabilities

cs.CR · 2024-04-11 · unverdicted · novelty 7.0

GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.

Evil Spectra: How Optimisers can Amplify or Suppress Emergent Misalignment

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

Optimizer choice during LLM fine-tuning produces up to 7x variation in emergent misalignment rates, with spectral regularization on LoRA adapters substantially mitigating misalignment for prone optimizers.

Skin-Deep: A Geometric Diagnostic for Alignment Fragility in Large Language Model Representations

cs.AI · 2026-06-21 · unverdicted · novelty 6.0

Skin-Deep extracts a Geometric Fragility Score from LLM activations that identifies which initially safe models retain the most refusal after small LoRA fine-tuning.

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

Safety Reflection Pretraining adds regular safety reflections to pretraining data to integrate self-monitoring and reduce unsafe generalization from safe data in LLMs.

Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts

cs.CL · 2026-06-09 · unverdicted · novelty 6.0

Schützen is a German-Bulgarian LLM safety dataset showing pronounced cross-language differences in model safety behavior.

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

CANARY detects 1% fine-tuning contamination with AUROC 1.000 using SAE-filtered hidden states, 7.5x below output-level detection thresholds, with zero false positives on benign tuning.

CSULoRA: Closest Safe Update Low-Rank Adaptation

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

CSULoRA decomposes LoRA updates into fully aligned, partially aligned, and off-subspace components and solves a closed-form penalized minimum-change problem to preserve safe parts while attenuating unsafe directions.

Early Data Exposure Improves Robustness to Subsequent Fine-Tuning

cs.LG · 2026-05-12 · conditional · novelty 6.0

Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

cs.CR · 2026-05-09 · unverdicted · novelty 6.0

A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

cs.CY · 2026-04-27 · unverdicted · novelty 6.0

Benign fine-tuning of foundation models induces large, heterogeneous, and often contradictory changes in safety metrics across general and domain-specific benchmarks.

Representation-Guided Parameter-Efficient LLM Unlearning

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

Continual Safety Alignment via Gradient-Based Sample Selection

cs.LG · 2026-04-19 · unverdicted · novelty 6.0

Gradient-based selection that drops high-gradient samples during continual fine-tuning preserves safety alignment in LLMs better than standard fine-tuning while keeping task performance competitive.

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.

Robust Policy Optimization to Prevent Catastrophic Forgetting

cs.LG · 2026-02-09 · unverdicted · novelty 6.0

FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.

Learning to Ask: When LLM Agents Meet Unclear Instruction

cs.CL · 2024-08-31 · unverdicted · novelty 6.0

Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.

Defending Against Harmful Supervision Hidden in Benign Samples

cs.CR · 2026-06-29 · unverdicted · novelty 5.0

The paper proposes Dual-Reference SFT (DR-SFT) to defend LLMs against harmful QA pairs embedded in benign training samples, where existing guardrails fail at the example level.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Early Data Exposure Improves Robustness to Subsequent Fine-Tuning cs.LG · 2026-05-12 · conditional · none · ref 22
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks cs.LG · 2026-05-26 · conditional · none · ref 17
Abliteration and prefilling attacks raise harm success rates on safeguarded open-weight LLMs from below 10% to 16-96% across three benchmarks, and a new ART tuning method reduces those rates by 10-20%.

arXiv preprint arXiv:2310.02949 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer