Recognition: unknown
RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation
Pith reviewed 2026-05-10 06:52 UTC · model grok-4.3
The pith
RoTRAG retrieves human-written Rules of Thumb to ground LLM harm assessment in multi-turn dialogues and improve F1 by around 40 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RoTRAG is a retrieval-augmented framework that incorporates concise human-written moral norms, called Rules of Thumb, into LLM-based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn-level reasoning and final severity classification. A lightweight binary routing classifier decides whether a new turn requires retrieval-grounded reasoning or can reuse existing context.
What carries the argument
Retrieval of relevant Rules of Thumb from an external corpus as explicit normative evidence, combined with a lightweight binary routing classifier that skips unnecessary retrievals.
If this is right
- Harm classification reaches an average relative F1 gain of around 40 percent across the benchmark datasets.
- Severity estimation records an average relative reduction of 8.4 percent in distributional error.
- Redundant computation drops without any drop in overall performance.
- Judgments become more consistent in nuanced social contexts by direct reference to retrieved human norms.
Where Pith is reading between the lines
- The same retrieval-plus-routing pattern could be tested on other normative or safety-related language tasks outside harm detection.
- Performance would likely depend on how well the fixed Rules of Thumb corpus covers cultural or emerging social situations not present in the training benchmarks.
- Adding a mechanism to insert newly observed norms into the corpus could address coverage gaps that a static collection cannot handle.
- Real-world logs of live conversations would provide a stronger test than the current academic datasets for whether the efficiency gains hold under variable user behavior.
Load-bearing premise
A fixed external corpus of human-written Rules of Thumb supplies sufficiently complete, unbiased, and contextually relevant normative guidance for the full range of multi-turn conversational situations the system will encounter.
What would settle it
Running RoTRAG on a fresh collection of multi-turn dialogues that contain previously unseen social contexts and norms, then observing no gain or a loss in F1 and severity accuracy relative to non-retrieval baselines, would falsify the claimed benefit.
Figures
read the original abstract
Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RoTRAG, a retrieval-augmented framework for harm detection and severity estimation in multi-turn dialogues. It retrieves concise human-written Rules of Thumb (RoTs) from a fixed external corpus to ground LLM reasoning, and introduces a lightweight binary routing classifier to decide whether retrieval is needed for a new turn or prior context can be reused. Experiments on the ProsocialDialog and Safety Reasoning Multi Turn Dialogue benchmarks report an average ~40% relative F1 gain and 8.4% reduction in distributional error versus competitive baselines, while claiming reduced redundant computation.
Significance. If the performance claims hold under rigorous validation, the work could meaningfully advance conversational safety systems by shifting from purely parametric knowledge to explicit normative grounding, improving interpretability. The routing mechanism offers a practical efficiency contribution. However, the significance is limited by the untested assumption that a static RoT corpus supplies sufficiently complete guidance across diverse multi-turn scenarios; without evidence on coverage or failure modes, the gains may not generalize beyond the two evaluated datasets.
major comments (2)
- [Experimental Evaluation] Experimental Evaluation section: The central claims of 40% relative F1 gain and 8.4% distributional error reduction are reported as summary statistics without statistical significance tests, confidence intervals, per-dataset breakdowns, or error analysis. This undermines assessment of whether the improvements are reliable or driven by specific conversation types.
- [RoT Corpus and Retrieval] RoT Corpus and Retrieval subsection: The performance improvements rest on the assumption that retrieval from the fixed external RoT corpus consistently supplies contextually complete and unbiased normative principles for unseen multi-turn dynamics. No analysis of corpus coverage, retrieval recall rates, or ablation on low-recall cases is provided, leaving open that the system may revert to base LLM behavior in novel situations.
minor comments (2)
- [Abstract] The abstract and methods refer to 'competitive baselines' without naming or citing them explicitly; this should be clarified with precise descriptions or references.
- [Results] The term 'distributional error' is used in results but lacks a formal definition or equation; adding one would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each major comment below and describe the revisions we will make to improve the paper's rigor and clarity.
read point-by-point responses
-
Referee: [Experimental Evaluation] The central claims of 40% relative F1 gain and 8.4% distributional error reduction are reported as summary statistics without statistical significance tests, confidence intervals, per-dataset breakdowns, or error analysis. This undermines assessment of whether the improvements are reliable or driven by specific conversation types.
Authors: We agree that the current presentation of aggregate results limits interpretability. In the revised manuscript, we will expand the Experimental Evaluation section to include per-dataset breakdowns for ProsocialDialog and Safety Reasoning Multi Turn Dialogue, 95% confidence intervals, statistical significance tests (e.g., paired t-tests or McNemar's test on F1 improvements), and a dedicated error analysis subsection that categorizes cases by conversation type and highlights where RoTRAG provides the largest gains versus baselines. revision: yes
-
Referee: [RoT Corpus and Retrieval] The performance improvements rest on the assumption that retrieval from the fixed external RoT corpus consistently supplies contextually complete and unbiased normative principles for unseen multi-turn dynamics. No analysis of corpus coverage, retrieval recall rates, or ablation on low-recall cases is provided, leaving open that the system may revert to base LLM behavior in novel situations.
Authors: We acknowledge this as a valid concern regarding generalization. While the reported gains on both benchmarks indicate that retrieved RoTs provide useful normative grounding in the tested scenarios, we will add retrieval recall statistics and an ablation study on low-recall cases to the RoT Corpus and Retrieval subsection. We will also expand the discussion of limitations to explicitly address potential coverage gaps for novel multi-turn dynamics and note that the routing classifier is designed to fall back to parametric reasoning when retrieval is not triggered. revision: partial
Circularity Check
No circularity: framework grounded in external RoT corpus and evaluated on public benchmarks
full rationale
The paper proposes RoTRAG as a retrieval-augmented system that retrieves human-written Rules of Thumb from a fixed external corpus and uses a lightweight routing classifier for efficiency. Performance claims (F1 gains and error reductions) are presented as empirical results on the public ProsocialDialog and Safety Reasoning Multi Turn Dialogue datasets. No equations, derivations, or self-citations in the abstract reduce the claimed improvements to quantities fitted or defined inside the paper itself. The central claims rest on external normative data and standard benchmark evaluation rather than self-referential construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can produce more consistent and interpretable harm judgments when supplied with retrieved normative rules than when relying solely on parametric knowledge.
invented entities (1)
-
Binary routing classifier
no independent evidence
Reference graph
Works this paper leans on
-
[1]
[n. d.]. Claude 3.7 Sonnet System Card. https://api.semanticscholar.org/CorpusID: 276612236
-
[2]
Lora Aroyo, Alex Taylor, Mark Diaz, Christopher Homan, Alicia Parrish, Gre- gory Serapio-García, Vinodkumar Prabhakaran, and Ding Wang. 2023. Dices dataset: Diversity in conversational ai evaluation for safety.Advances in Neural Information Processing Systems36 (2023), 53330–53342
2023
-
[3]
Jonathan P Chang and Cristian Danescu-Niculescu-Mizil. 2019. Trouble on the horizon: Forecasting the derailment of online conversations as they develop. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4743–4754
2019
-
[4]
Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, et al. 2025. RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration.arXiv preprint arXiv:2509.25271(2025)
-
[5]
Emily Dinan, Gavin Abercrombie, Shannon L Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. 2022. SafetyKit: First aid for measuring safety in open-domain conversational systems. InProceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers). 4113–4133
2022
-
[6]
Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. 2024. Attacks, defenses and evaluations for llm conversation safety: A survey. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers). 6734–6747
2024
-
[7]
Denis Emelin, Ronan Le Bras, Jena D Hwang, Maxwell Forbes, and Yejin Choi
-
[8]
InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Moral stories: Situated reasoning about norms, intents, actions, and their consequences. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 698–718
2021
-
[9]
Maxwell Forbes, Jena D Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi
-
[10]
InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Social chemistry 101: Learning to reason about social and moral norms. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 653–670
2020
-
[11]
Abhiram Rao Gorle, Amit Kumar Singh Yadav, and Tsachy Weissman. 2025. Quan- tifying Information Gain and Redundancy in Multi-Turn LLM Conversations. In First Workshop on Multi-Turn Interactions in Large Language Models
2025
-
[12]
Weiyang Guo, Jing Li, Wenya Wang, Yu Li, Daojing He, Jun Yu, and Min Zhang
-
[13]
InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers)
Mtsa: Multi-turn safety alignment for llms through multi-round red- teaming. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 26424–26442
-
[14]
Hyunwoo Kim, Youngjae Yu, Liwei Jiang, Ximing Lu, Daniel Khashabi, Gunhee Kim, Yejin Choi, and Maarten Sap. 2022. Prosocialdialog: A prosocial backbone for conversational agents. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 4005–4029
2022
-
[15]
Siwon Kim, Shuyang Dai, Mohammad Kachuee, Shayan Ray, Tara Taghavi, and Sungroh Yoon. 2024. GrounDial: Human-norm grounded safe dialog response generation. InFindings of the Association for Computational Linguistics: EACL
2024
- [16]
-
[17]
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120(2025)
work page internal anchor Pith review arXiv 2025
-
[18]
Juhyeon Lee, Wonduk Seo, Hyunjin An, Seunghyun Lee, and Yi Bu. 2025. Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization. In2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 269–272
2025
-
[19]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474
2020
-
[20]
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)
work page internal anchor Pith review arXiv 2025
-
[21]
Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, and Hao Liu
- [22]
-
[23]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[24]
Andreas Madsen, Sarath Chandar, and Siva Reddy. 2024. Are self-explanations from Large Language Models faithful?. InFindings of the Association for Compu- tational Linguistics: ACL 2024. 295–337
2024
-
[25]
Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and Dilek Hakkani-Tur. 2023. Using in-context learning to improve dialogue safety. InFindings of the Association for Computational Linguistics: EMNLP 2023. 11882–11910
2023
-
[26]
GPT OpenAI. 2024. GPT-4o mini: advancing cost-efficient intelligence. https://openai. com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/(2024)
2024
-
[27]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744
2022
-
[28]
John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, and Ion Androut- sopoulos. 2020. Toxicity detection: Does context really matter?. InProceedings of the 58th annual meeting of the association for computational linguistics. 4296–4305
2020
-
[29]
Huachuan Qiu, Tong Zhao, Anqi Li, Shuai Zhang, Hongliang He, and Zhenzhong Lan. 2023. A benchmark for understanding dialogue safety in mental health support. InCCF International Conference on Natural Language Processing and Chinese Computing. Springer, 1–13
2023
-
[30]
Kavel Rao, Liwei Jiang, Valentina Pyatkin, Yuling Gu, Niket Tandon, Nouha Dziri, Faeze Brahman, and Yejin Choi. 2023. What makes it ok to set a fire? Iterative self-distillation of contexts and rationales for disambiguating defeasible social and moral situations. InFindings of the Association for Computational Linguistics: RoTRAG: Rule of Thumb Reasoning ...
2023
-
[31]
Pritish Sahu, Anirudh Som, Ajay Divakaran, and Dimitra Vergyri. 2025. MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. 2039–2052
2025
- [32]
- [33]
-
[34]
Wonduk Seo, Zonghao Yuan, and Yi Bu. 2025. Valuesrag: Enhancing cultural alignment through retrieval-augmented contextual learning. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 8. 2307–2318
2025
-
[35]
Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models.Nature623, 7987 (2023), 493–498
2023
-
[36]
Guangzhi Sun, Xiao Zhan, Shutong Feng, Philip C Woodland, and Jose Such
-
[37]
CASE-Bench: Context-aware safety benchmark for large language models. arXiv preprint arXiv:2501.14940(2025)
-
[38]
Hao Sun, Guangxuan Xu, Jiawen Deng, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, and Minlie Huang. 2022. On the safety of conversa- tional models: Taxonomy, dataset, and benchmark. InFindings of the Association for Computational Linguistics: ACL 2022. 3906–3923
2022
-
[39]
Megan Ung, Jing Xu, and Y-Lan Boureau. 2022. SaFeRDialogues: Taking feedback gracefully after conversational safety failures. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6462–6481
2022
-
[40]
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)
work page internal anchor Pith review arXiv 2022
-
[41]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InThe Eleventh International Conference on Learning Representations
2023
-
[42]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837
2022
-
[43]
Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2021. Bot-adversarial dialogue for safe conversational agents. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2950–2968
2021
-
[44]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Erxin Yu, Jing Li, Ming Liao, Siqi Wang, Gao Zuchen, Fei Mi, and Lanqing Hong
-
[46]
InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Cosafe: Evaluating large language model safety in multi-turn dialogue coreference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 17494–17508
2024
-
[47]
Xinchen Yu, Eduardo Blanco, and Lingzi Hong. 2022. Hate speech and counter speech detection: Conversational context does matter. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5918–5930
2022
-
[48]
Haolan Zhan, Zhuang Li, Yufei Wang, Linhao Luo, Tao Feng, Xiaoxi Kang, Yuncheng Hua, Lizhen Qu, Lay-Ki Soon, Suraj Sharma, et al . 2023. Socialdial: A benchmark for socially-aware dialogue systems. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2712–2722
2023
-
[49]
Mian Zhang, Lifeng Jin, Linfeng Song, Haitao Mi, Wenliang Chen, and Dong Yu. 2023. SafeConv: Explaining and correcting conversational unsafe behavior. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 22–35
2023
-
[50]
Caleb Ziems, Jane Dwivedi-Yu, Yi-Chia Wang, Alon Halevy, and Diyi Yang. 2023. NormBank: A knowledge bank of situational social norms. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7756–7776. A Classifier Training Details A.1 Train Dataset Construction To construct supervision data for th...
2023
-
[51]
You shouldn't throw away a roommate's food without their consent
-
[52]
obviously harmful or illegal
It's inappropriate to dictate others' food choices in a shared living situation. Correspond RoT: RoT 1: It's bad to comment on what other people choose to eat. RoT 2: It's wrong to tell other people what to eat RoT 3: It's arrogant to make decisions for other people, especially their diets. RoT 4: You should not comment on other people's food choices. RoT...
2018
-
[53]
reasoning
When managing relationship dynamics, you should address your personal insecurities. 2.It's wrong to influence others' insecurities through manipulation and third-party involvement. Correspond RoT: RoT 1: It's wrong to play with people's emotions. RoT 2: It's wrong to manipulate your friends for your own benefit RoT 3: Sometimes you have to tell a lie to g...
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.