arXiv preprint arXiv:2411.09003 , year=

Refusal in llms is an affine function , author= · 2024 · arXiv 2411.09003

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

Expert-Aware Refusal Steering

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

cs.LG · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

RefusalGuard constrains updates in hidden representation space to preserve safety-relevant geometric structure during fine-tuning, maintaining low attack success rates on safety benchmarks while preserving task performance.

Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

Geometric Unlearning distills a low-rank safe subspace from reference prompts and applies projection-based alignment on synthetic anchors to suppress target content while preserving non-target utility.

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

cs.CR · 2026-05-17 · unverdicted · novelty 5.0

Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures cs.CL · 2026-05-31 · conditional · none · ref 33
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
Expert-Aware Refusal Steering cs.CL · 2026-06-02 · unverdicted · none · ref 8
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs cs.LG · 2026-05-03 · unverdicted · none · ref 7 · 2 links
RefusalGuard constrains updates in hidden representation space to preserve safety-relevant geometric structure during fine-tuning, maintaining low attack success rates on safety benchmarks while preserving task performance.
Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure cs.CL · 2026-05-03 · unverdicted · none · ref 40
Geometric Unlearning distills a low-rank safe subspace from reference prompts and applies projection-based alignment on synthetic anchors to suppress target content while preserving non-target utility.
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications cs.CR · 2026-05-17 · unverdicted · none · ref 23
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.

arXiv preprint arXiv:2411.09003 , year=

fields

years

verdicts

representative citing papers

citing papers explorer