SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
arXiv preprint arXiv:2404.01295 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Multi-objective genetic prompt optimization creates multi-turn deceptive datasets validated by humans, then detected with 0.89 recall using angular coverage, distance ratio, and linearity features in embeddings.
citing papers explorer
-
Addressing Over-Refusal in LLMs with Competing Rewards
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
-
Evolving and Detecting Multi-Turn Deception using Geometric Signatures
Multi-objective genetic prompt optimization creates multi-turn deceptive datasets validated by humans, then detected with 0.89 recall using angular coverage, distance ratio, and linearity features in embeddings.