Refusal in llms is an affine function

· 2024 · arXiv 2411.09003

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

cs.LG · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

RefusalGuard constrains updates in hidden representation space to preserve safety-relevant geometric structure during fine-tuning, maintaining low attack success rates on safety benchmarks while preserving task performance.

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

cs.CR · 2026-05-17 · unverdicted · novelty 5.0

Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.

Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure

cs.CL · 2026-05-03

citing papers explorer

Showing 3 of 3 citing papers.

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs cs.LG · 2026-05-03 · unverdicted · none · ref 7 · 2 links
RefusalGuard constrains updates in hidden representation space to preserve safety-relevant geometric structure during fine-tuning, maintaining low attack success rates on safety benchmarks while preserving task performance.
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications cs.CR · 2026-05-17 · unverdicted · none · ref 23
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.
Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure cs.CL · 2026-05-03 · unreviewed · ref 40

Refusal in llms is an affine function

fields

years

verdicts

representative citing papers

citing papers explorer