pith. sign in

Refusal in llms is an affine function

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

years

2026 3

representative citing papers

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

cs.LG · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

RefusalGuard constrains updates in hidden representation space to preserve safety-relevant geometric structure during fine-tuning, maintaining low attack success rates on safety benchmarks while preserving task performance.

citing papers explorer

Showing 3 of 3 citing papers.