pith. sign in

arXiv preprint arXiv:2411.09003 , year=

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

years

2026 5

clear filters

representative citing papers

Expert-Aware Refusal Steering

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

cs.LG · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

RefusalGuard constrains updates in hidden representation space to preserve safety-relevant geometric structure during fine-tuning, maintaining low attack success rates on safety benchmarks while preserving task performance.

citing papers explorer

Showing 5 of 5 citing papers after filters.