pith. sign in

hub Mixed citations

Jailbreak and guard aligned language models with only few in-context demonstrations

Mixed citation behavior. Most common role is background (60%).

21 Pith papers citing it
Background 60% of classified citations

hub tools

citation-role summary

background 3 method 2

citation-polarity summary

clear filters

representative citing papers

Secure LLM Fine-Tuning via Safety-Aware Probing

cs.LG · 2025-05-22 · unverdicted · novelty 6.0

SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.

citing papers explorer

Showing 2 of 2 citing papers after filters.