arXiv preprint arXiv:2404.01295 , year=

Towards safety, helpfulness balanced responses via controllable large language models , author= · 2024 · arXiv 2404.01295

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

Addressing Over-Refusal in LLMs with Competing Rewards

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.

Evolving and Detecting Multi-Turn Deception using Geometric Signatures

stat.ML · 2026-05-26 · unverdicted · novelty 6.0

Multi-objective genetic prompt optimization creates multi-turn deceptive datasets validated by humans, then detected with 0.89 recall using angular coverage, distance ratio, and linearity features in embeddings.

citing papers explorer

Showing 2 of 2 citing papers.

Addressing Over-Refusal in LLMs with Competing Rewards cs.LG · 2026-06-30 · unverdicted · none · ref 80
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
Evolving and Detecting Multi-Turn Deception using Geometric Signatures stat.ML · 2026-05-26 · unverdicted · none · ref 2
Multi-objective genetic prompt optimization creates multi-turn deceptive datasets validated by humans, then detected with 0.89 recall using angular coverage, distance ratio, and linearity features in embeddings.

arXiv preprint arXiv:2404.01295 , year=

fields

years

verdicts

representative citing papers

citing papers explorer