HarmThoughts is a sentence-level benchmark with a 16-behavior taxonomy that reveals existing detectors struggle to identify fine-grained harmful reasoning steps in AI traces.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
HarmThoughts is a sentence-level benchmark with a 16-behavior taxonomy that reveals existing detectors struggle to identify fine-grained harmful reasoning steps in AI traces.