pith. machine review for the scientific record. sign in

arxiv: 2604.17301 · v1 · submitted 2026-04-19 · 💻 cs.CL · cs.AI· cs.HC· cs.IR· cs.LG

Recognition: unknown

RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HCcs.IRcs.LG
keywords harm detectionretrieval-augmented generationrules of thumbmulti-turn dialogueconversation safetyLLM reasoningnormative guidance
0
0 comments X

The pith

RoTRAG retrieves human-written Rules of Thumb to ground LLM harm assessment in multi-turn dialogues and improve F1 by around 40 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoTRAG, a retrieval-augmented framework that pulls concise human-written moral norms called Rules of Thumb from an external corpus to guide large language models in spotting harmful content during ongoing conversations. Current approaches often rely only on the model's internal knowledge, which produces inconsistent judgments in socially complex situations and repeats reasoning across turns. RoTRAG supplies explicit normative evidence for each turn's reasoning and final severity rating, while a lightweight binary classifier decides when retrieval is actually needed. Tests on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show higher accuracy in harm classification and severity estimation plus lower redundant work compared with competitive baselines.

Core claim

RoTRAG is a retrieval-augmented framework that incorporates concise human-written moral norms, called Rules of Thumb, into LLM-based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn-level reasoning and final severity classification. A lightweight binary routing classifier decides whether a new turn requires retrieval-grounded reasoning or can reuse existing context.

What carries the argument

Retrieval of relevant Rules of Thumb from an external corpus as explicit normative evidence, combined with a lightweight binary routing classifier that skips unnecessary retrievals.

If this is right

  • Harm classification reaches an average relative F1 gain of around 40 percent across the benchmark datasets.
  • Severity estimation records an average relative reduction of 8.4 percent in distributional error.
  • Redundant computation drops without any drop in overall performance.
  • Judgments become more consistent in nuanced social contexts by direct reference to retrieved human norms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-plus-routing pattern could be tested on other normative or safety-related language tasks outside harm detection.
  • Performance would likely depend on how well the fixed Rules of Thumb corpus covers cultural or emerging social situations not present in the training benchmarks.
  • Adding a mechanism to insert newly observed norms into the corpus could address coverage gaps that a static collection cannot handle.
  • Real-world logs of live conversations would provide a stronger test than the current academic datasets for whether the efficiency gains hold under variable user behavior.

Load-bearing premise

A fixed external corpus of human-written Rules of Thumb supplies sufficiently complete, unbiased, and contextually relevant normative guidance for the full range of multi-turn conversational situations the system will encounter.

What would settle it

Running RoTRAG on a fresh collection of multi-turn dialogues that contain previously unseen social contexts and norms, then observing no gain or a loss in F1 and severity accuracy relative to non-retrieval baselines, would falsify the claimed benefit.

Figures

Figures reproduced from arXiv: 2604.17301 by Haihua Chen, Juhyeon Lee, Junseo Koh, Seunghyun Lee, Wonduk Seo, Yi Bu.

Figure 1
Figure 1. Figure 1: Overview of RoTRAG. Illustrative example com [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Rule of Thumb Retrieval-Augmented Generation (RoTRAG). Given the previous Rule of Thumb (RoT), the current [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average token usage of each method on the Proso [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Token-performance trade-off across methods on [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average time required to generate one prediction [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Validation accuracy of five candidate models across [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Validation-set confusion matrix of the routing clas [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prosocial case study illustrating how RoTRAG retrieves relevant normative evidence, generates a context-specific RoT, [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Safety case study illustrating how RoTRAG retrieves manipulation-related normative evidence, generates a new RoT [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RoTRAG, a retrieval-augmented framework for harm detection and severity estimation in multi-turn dialogues. It retrieves concise human-written Rules of Thumb (RoTs) from a fixed external corpus to ground LLM reasoning, and introduces a lightweight binary routing classifier to decide whether retrieval is needed for a new turn or prior context can be reused. Experiments on the ProsocialDialog and Safety Reasoning Multi Turn Dialogue benchmarks report an average ~40% relative F1 gain and 8.4% reduction in distributional error versus competitive baselines, while claiming reduced redundant computation.

Significance. If the performance claims hold under rigorous validation, the work could meaningfully advance conversational safety systems by shifting from purely parametric knowledge to explicit normative grounding, improving interpretability. The routing mechanism offers a practical efficiency contribution. However, the significance is limited by the untested assumption that a static RoT corpus supplies sufficiently complete guidance across diverse multi-turn scenarios; without evidence on coverage or failure modes, the gains may not generalize beyond the two evaluated datasets.

major comments (2)
  1. [Experimental Evaluation] Experimental Evaluation section: The central claims of 40% relative F1 gain and 8.4% distributional error reduction are reported as summary statistics without statistical significance tests, confidence intervals, per-dataset breakdowns, or error analysis. This undermines assessment of whether the improvements are reliable or driven by specific conversation types.
  2. [RoT Corpus and Retrieval] RoT Corpus and Retrieval subsection: The performance improvements rest on the assumption that retrieval from the fixed external RoT corpus consistently supplies contextually complete and unbiased normative principles for unseen multi-turn dynamics. No analysis of corpus coverage, retrieval recall rates, or ablation on low-recall cases is provided, leaving open that the system may revert to base LLM behavior in novel situations.
minor comments (2)
  1. [Abstract] The abstract and methods refer to 'competitive baselines' without naming or citing them explicitly; this should be clarified with precise descriptions or references.
  2. [Results] The term 'distributional error' is used in results but lacks a formal definition or equation; adding one would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each major comment below and describe the revisions we will make to improve the paper's rigor and clarity.

read point-by-point responses
  1. Referee: [Experimental Evaluation] The central claims of 40% relative F1 gain and 8.4% distributional error reduction are reported as summary statistics without statistical significance tests, confidence intervals, per-dataset breakdowns, or error analysis. This undermines assessment of whether the improvements are reliable or driven by specific conversation types.

    Authors: We agree that the current presentation of aggregate results limits interpretability. In the revised manuscript, we will expand the Experimental Evaluation section to include per-dataset breakdowns for ProsocialDialog and Safety Reasoning Multi Turn Dialogue, 95% confidence intervals, statistical significance tests (e.g., paired t-tests or McNemar's test on F1 improvements), and a dedicated error analysis subsection that categorizes cases by conversation type and highlights where RoTRAG provides the largest gains versus baselines. revision: yes

  2. Referee: [RoT Corpus and Retrieval] The performance improvements rest on the assumption that retrieval from the fixed external RoT corpus consistently supplies contextually complete and unbiased normative principles for unseen multi-turn dynamics. No analysis of corpus coverage, retrieval recall rates, or ablation on low-recall cases is provided, leaving open that the system may revert to base LLM behavior in novel situations.

    Authors: We acknowledge this as a valid concern regarding generalization. While the reported gains on both benchmarks indicate that retrieved RoTs provide useful normative grounding in the tested scenarios, we will add retrieval recall statistics and an ablation study on low-recall cases to the RoT Corpus and Retrieval subsection. We will also expand the discussion of limitations to explicitly address potential coverage gaps for novel multi-turn dynamics and note that the routing classifier is designed to fall back to parametric reasoning when retrieval is not triggered. revision: partial

Circularity Check

0 steps flagged

No circularity: framework grounded in external RoT corpus and evaluated on public benchmarks

full rationale

The paper proposes RoTRAG as a retrieval-augmented system that retrieves human-written Rules of Thumb from a fixed external corpus and uses a lightweight routing classifier for efficiency. Performance claims (F1 gains and error reductions) are presented as empirical results on the public ProsocialDialog and Safety Reasoning Multi Turn Dialogue datasets. No equations, derivations, or self-citations in the abstract reduce the claimed improvements to quantities fitted or defined inside the paper itself. The central claims rest on external normative data and standard benchmark evaluation rather than self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the premise that external human-written normative rules can be retrieved and used effectively by LLMs; the binary router is an added component whose training details are not visible.

axioms (1)
  • domain assumption LLMs can produce more consistent and interpretable harm judgments when supplied with retrieved normative rules than when relying solely on parametric knowledge.
    This is the core justification for adding the retrieval step.
invented entities (1)
  • Binary routing classifier no independent evidence
    purpose: Decides for each turn whether retrieval of new RoTs is required or prior context suffices.
    Introduced to reduce redundant computation; no independent evidence of its accuracy is provided in the abstract.

pith-pipeline@v0.9.0 · 5526 in / 1422 out tokens · 53706 ms · 2026-05-10T06:52:27.859270+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    [n. d.]. Claude 3.7 Sonnet System Card. https://api.semanticscholar.org/CorpusID: 276612236

  2. [2]

    Lora Aroyo, Alex Taylor, Mark Diaz, Christopher Homan, Alicia Parrish, Gre- gory Serapio-García, Vinodkumar Prabhakaran, and Ding Wang. 2023. Dices dataset: Diversity in conversational ai evaluation for safety.Advances in Neural Information Processing Systems36 (2023), 53330–53342

  3. [3]

    Jonathan P Chang and Cristian Danescu-Niculescu-Mizil. 2019. Trouble on the horizon: Forecasting the derailment of online conversations as they develop. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4743–4754

  4. [4]

    Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, et al. 2025. RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration.arXiv preprint arXiv:2509.25271(2025)

  5. [5]

    Emily Dinan, Gavin Abercrombie, Shannon L Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. 2022. SafetyKit: First aid for measuring safety in open-domain conversational systems. InProceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers). 4113–4133

  6. [6]

    Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. 2024. Attacks, defenses and evaluations for llm conversation safety: A survey. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers). 6734–6747

  7. [7]

    Denis Emelin, Ronan Le Bras, Jena D Hwang, Maxwell Forbes, and Yejin Choi

  8. [8]

    InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

    Moral stories: Situated reasoning about norms, intents, actions, and their consequences. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 698–718

  9. [9]

    Maxwell Forbes, Jena D Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi

  10. [10]

    InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Social chemistry 101: Learning to reason about social and moral norms. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 653–670

  11. [11]

    Abhiram Rao Gorle, Amit Kumar Singh Yadav, and Tsachy Weissman. 2025. Quan- tifying Information Gain and Redundancy in Multi-Turn LLM Conversations. In First Workshop on Multi-Turn Interactions in Large Language Models

  12. [12]

    Weiyang Guo, Jing Li, Wenya Wang, Yu Li, Daojing He, Jun Yu, and Min Zhang

  13. [13]

    InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers)

    Mtsa: Multi-turn safety alignment for llms through multi-round red- teaming. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 26424–26442

  14. [14]

    Hyunwoo Kim, Youngjae Yu, Liwei Jiang, Ximing Lu, Daniel Khashabi, Gunhee Kim, Yejin Choi, and Maarten Sap. 2022. Prosocialdialog: A prosocial backbone for conversational agents. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 4005–4029

  15. [15]

    Siwon Kim, Shuyang Dai, Mohammad Kachuee, Shayan Ray, Tara Taghavi, and Sungroh Yoon. 2024. GrounDial: Human-norm grounded safe dialog response generation. InFindings of the Association for Computational Linguistics: EACL

  16. [16]

    Martin Kuo, Jianyi Zhang, Aolin Ding, Louis DiValentin, Amin Hass, Benjamin F Morris, Isaac Jacobson, Randolph Linderman, James Kiessling, Nicolas Ramos, et al. 2025. SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues. arXiv preprint arXiv:2506.00668(2025)

  17. [17]

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120(2025)

  18. [18]

    Juhyeon Lee, Wonduk Seo, Hyunjin An, Seunghyun Lee, and Yi Bu. 2025. Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization. In2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 269–272

  19. [19]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

  20. [20]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

  21. [21]

    Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, and Hao Liu

  22. [22]

    Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.arXiv preprint arXiv:2410.12855 (2024)

  23. [23]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

  24. [24]

    Andreas Madsen, Sarath Chandar, and Siva Reddy. 2024. Are self-explanations from Large Language Models faithful?. InFindings of the Association for Compu- tational Linguistics: ACL 2024. 295–337

  25. [25]

    Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and Dilek Hakkani-Tur. 2023. Using in-context learning to improve dialogue safety. InFindings of the Association for Computational Linguistics: EMNLP 2023. 11882–11910

  26. [26]

    GPT OpenAI. 2024. GPT-4o mini: advancing cost-efficient intelligence. https://openai. com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/(2024)

  27. [27]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  28. [28]

    John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, and Ion Androut- sopoulos. 2020. Toxicity detection: Does context really matter?. InProceedings of the 58th annual meeting of the association for computational linguistics. 4296–4305

  29. [29]

    Huachuan Qiu, Tong Zhao, Anqi Li, Shuai Zhang, Hongliang He, and Zhenzhong Lan. 2023. A benchmark for understanding dialogue safety in mental health support. InCCF International Conference on Natural Language Processing and Chinese Computing. Springer, 1–13

  30. [30]

    Kavel Rao, Liwei Jiang, Valentina Pyatkin, Yuling Gu, Niket Tandon, Nouha Dziri, Faeze Brahman, and Yejin Choi. 2023. What makes it ok to set a fire? Iterative self-distillation of contexts and rationales for disambiguating defeasible social and moral situations. InFindings of the Association for Computational Linguistics: RoTRAG: Rule of Thumb Reasoning ...

  31. [31]

    Pritish Sahu, Anirudh Som, Ajay Divakaran, and Dimitra Vergyri. 2025. MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. 2039–2052

  32. [32]

    Wonduk Seo, Juhyeon Lee, Junseo Koh, Hyunjin An, Jian Park, Seunghyun Lee, Haihua Chen, and Yi Bu. 2025. Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis.arXiv preprint arXiv:2510.16635(2025)

  33. [33]

    Wonduk Seo, Taesub Shin, Hyunjin An, Dokyun Kim, and Seunghyun Lee. 2025. Question-to-Knowledge (Q2K): Multi-Agent Generation of Inspectable Facts for Product Mapping.arXiv preprint arXiv:2509.01182(2025)

  34. [34]

    Wonduk Seo, Zonghao Yuan, and Yi Bu. 2025. Valuesrag: Enhancing cultural alignment through retrieval-augmented contextual learning. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 8. 2307–2318

  35. [35]

    Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models.Nature623, 7987 (2023), 493–498

  36. [36]

    Guangzhi Sun, Xiao Zhan, Shutong Feng, Philip C Woodland, and Jose Such

  37. [37]

    Woodland, and Jose Such

    CASE-Bench: Context-aware safety benchmark for large language models. arXiv preprint arXiv:2501.14940(2025)

  38. [38]

    Hao Sun, Guangxuan Xu, Jiawen Deng, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, and Minlie Huang. 2022. On the safety of conversa- tional models: Taxonomy, dataset, and benchmark. InFindings of the Association for Computational Linguistics: ACL 2022. 3906–3923

  39. [39]

    Megan Ung, Jing Xu, and Y-Lan Boureau. 2022. SaFeRDialogues: Taking feedback gracefully after conversational safety failures. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6462–6481

  40. [40]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

  41. [41]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InThe Eleventh International Conference on Learning Representations

  42. [42]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  43. [43]

    Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2021. Bot-adversarial dialogue for safe conversational agents. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2950–2968

  44. [44]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  45. [45]

    Erxin Yu, Jing Li, Ming Liao, Siqi Wang, Gao Zuchen, Fei Mi, and Lanqing Hong

  46. [46]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Cosafe: Evaluating large language model safety in multi-turn dialogue coreference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 17494–17508

  47. [47]

    Xinchen Yu, Eduardo Blanco, and Lingzi Hong. 2022. Hate speech and counter speech detection: Conversational context does matter. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5918–5930

  48. [48]

    Haolan Zhan, Zhuang Li, Yufei Wang, Linhao Luo, Tao Feng, Xiaoxi Kang, Yuncheng Hua, Lizhen Qu, Lay-Ki Soon, Suraj Sharma, et al . 2023. Socialdial: A benchmark for socially-aware dialogue systems. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2712–2722

  49. [49]

    Mian Zhang, Lifeng Jin, Linfeng Song, Haitao Mi, Wenliang Chen, and Dong Yu. 2023. SafeConv: Explaining and correcting conversational unsafe behavior. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 22–35

  50. [50]

    Caleb Ziems, Jane Dwivedi-Yu, Yi-Chia Wang, Alon Halevy, and Diyi Yang. 2023. NormBank: A knowledge bank of situational social norms. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7756–7776. A Classifier Training Details A.1 Train Dataset Construction To construct supervision data for th...

  51. [51]

    You shouldn't throw away a roommate's food without their consent

  52. [52]

    obviously harmful or illegal

    It's inappropriate to dictate others' food choices in a shared living situation. Correspond RoT: RoT 1: It's bad to comment on what other people choose to eat. RoT 2: It's wrong to tell other people what to eat RoT 3: It's arrogant to make decisions for other people, especially their diets. RoT 4: You should not comment on other people's food choices. RoT...

  53. [53]

    reasoning

    When managing relationship dynamics, you should address your personal insecurities. 2.It's wrong to influence others' insecurities through manipulation and third-party involvement. Correspond RoT: RoT 1: It's wrong to play with people's emotions. RoT 2: It's wrong to manipulate your friends for your own benefit RoT 3: Sometimes you have to tell a lie to g...