DDOR is a delta-debugging framework that localizes minimal refusal-triggering fragments for explainable overrefusal testing and targeted prompt repair in black-box LLMs.
Safeinfer: Context adaptive decoding time safety alignment for large language models.Proceedings of the AAAI Conference on Artificial Intelligence, 39(26):27188–27196, April 2025
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Position paper claiming that AI safety requires explicit runtime controllability and introducing ControlBench to demonstrate gaps in existing alignment methods.
citing papers explorer
-
DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair
DDOR is a delta-debugging framework that localizes minimal refusal-triggering fragments for explainable overrefusal testing and targeted prompt repair in black-box LLMs.