arxiv: 2604.17301 · v1 · submitted 2026-04-19 · 💻 cs.CL · cs.AI· cs.HC· cs.IR· cs.LG

Recognition: unknown

RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation

Juhyeon Lee , Wonduk Seo , Junseo Koh , Seunghyun Lee , Haihua Chen , Yi Bu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HCcs.IRcs.LG

keywords harm detectionretrieval-augmented generationrules of thumbmulti-turn dialogueconversation safetyLLM reasoningnormative guidance

0 comments

The pith

RoTRAG retrieves human-written Rules of Thumb to ground LLM harm assessment in multi-turn dialogues and improve F1 by around 40 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoTRAG, a retrieval-augmented framework that pulls concise human-written moral norms called Rules of Thumb from an external corpus to guide large language models in spotting harmful content during ongoing conversations. Current approaches often rely only on the model's internal knowledge, which produces inconsistent judgments in socially complex situations and repeats reasoning across turns. RoTRAG supplies explicit normative evidence for each turn's reasoning and final severity rating, while a lightweight binary classifier decides when retrieval is actually needed. Tests on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show higher accuracy in harm classification and severity estimation plus lower redundant work compared with competitive baselines.

Core claim

RoTRAG is a retrieval-augmented framework that incorporates concise human-written moral norms, called Rules of Thumb, into LLM-based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn-level reasoning and final severity classification. A lightweight binary routing classifier decides whether a new turn requires retrieval-grounded reasoning or can reuse existing context.

What carries the argument

Retrieval of relevant Rules of Thumb from an external corpus as explicit normative evidence, combined with a lightweight binary routing classifier that skips unnecessary retrievals.

If this is right

Harm classification reaches an average relative F1 gain of around 40 percent across the benchmark datasets.
Severity estimation records an average relative reduction of 8.4 percent in distributional error.
Redundant computation drops without any drop in overall performance.
Judgments become more consistent in nuanced social contexts by direct reference to retrieved human norms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-plus-routing pattern could be tested on other normative or safety-related language tasks outside harm detection.
Performance would likely depend on how well the fixed Rules of Thumb corpus covers cultural or emerging social situations not present in the training benchmarks.
Adding a mechanism to insert newly observed norms into the corpus could address coverage gaps that a static collection cannot handle.
Real-world logs of live conversations would provide a stronger test than the current academic datasets for whether the efficiency gains hold under variable user behavior.

Load-bearing premise

A fixed external corpus of human-written Rules of Thumb supplies sufficiently complete, unbiased, and contextually relevant normative guidance for the full range of multi-turn conversational situations the system will encounter.

What would settle it

Running RoTRAG on a fresh collection of multi-turn dialogues that contain previously unseen social contexts and norms, then observing no gain or a loss in F1 and severity accuracy relative to non-retrieval baselines, would falsify the claimed benefit.

Figures

Figures reproduced from arXiv: 2604.17301 by Haihua Chen, Juhyeon Lee, Junseo Koh, Seunghyun Lee, Wonduk Seo, Yi Bu.

**Figure 2.** Figure 2: Rule of Thumb Retrieval-Augmented Generation (RoTRAG). Given the previous Rule of Thumb (RoT), the current [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Average token usage of each method on the Proso [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Token-performance trade-off across methods on [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Average time required to generate one prediction [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Validation accuracy of five candidate models across [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Validation-set confusion matrix of the routing clas [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Prosocial case study illustrating how RoTRAG retrieves relevant normative evidence, generates a context-specific RoT, [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Safety case study illustrating how RoTRAG retrieves manipulation-related normative evidence, generates a new RoT [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoTRAG adds a rules-of-thumb retrieval step and a per-turn router to multi-turn harm detection, but the gains sit on summary stats and an untested assumption about corpus coverage.

read the letter

RoTRAG pairs retrieval from a rules-of-thumb corpus with a lightweight router to handle multi-turn harm detection in dialogues. The main takeaway is that this setup claims solid gains on two benchmarks while cutting some redundant LLM calls. What stands out as new is the binary routing classifier that decides per turn whether to fetch fresh RoTs or reuse prior context. That directly tackles the efficiency issue they flag. The paper also integrates the retrieved norms into both classification and severity scoring, which gives a bit more structure than pure parametric approaches. It does well at laying out the problem of inconsistent judgments in nuanced conversations and at showing average improvements: roughly 40% relative F1 boost and 8% error drop. The router seems to preserve performance while reducing computation, which is a useful engineering detail. The soft spots are in the evaluation and the core assumption. The abstract gives only high-level numbers with no visible details on baseline implementations, variance across runs, or cases where retrieval misses. More importantly, nothing checks whether the fixed RoT corpus actually supplies relevant guidance when conversations go off the beaten path. If recall is low, the system falls back to the base model anyway, and the router cannot fix that. Two datasets leave open questions about generalization. This paper is aimed at people building or evaluating safety features for conversational agents. A practitioner looking for ways to add external norms without full retraining might find the router and retrieval pattern worth trying. A researcher focused on foundational reasoning would probably pass. It deserves a serious referee. The work is concrete and the claims are falsifiable with more ablations. I would recommend sending it for review with requests for fuller methods, corpus analysis, and additional test sets.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RoTRAG, a retrieval-augmented framework for harm detection and severity estimation in multi-turn dialogues. It retrieves concise human-written Rules of Thumb (RoTs) from a fixed external corpus to ground LLM reasoning, and introduces a lightweight binary routing classifier to decide whether retrieval is needed for a new turn or prior context can be reused. Experiments on the ProsocialDialog and Safety Reasoning Multi Turn Dialogue benchmarks report an average ~40% relative F1 gain and 8.4% reduction in distributional error versus competitive baselines, while claiming reduced redundant computation.

Significance. If the performance claims hold under rigorous validation, the work could meaningfully advance conversational safety systems by shifting from purely parametric knowledge to explicit normative grounding, improving interpretability. The routing mechanism offers a practical efficiency contribution. However, the significance is limited by the untested assumption that a static RoT corpus supplies sufficiently complete guidance across diverse multi-turn scenarios; without evidence on coverage or failure modes, the gains may not generalize beyond the two evaluated datasets.

major comments (2)

[Experimental Evaluation] Experimental Evaluation section: The central claims of 40% relative F1 gain and 8.4% distributional error reduction are reported as summary statistics without statistical significance tests, confidence intervals, per-dataset breakdowns, or error analysis. This undermines assessment of whether the improvements are reliable or driven by specific conversation types.
[RoT Corpus and Retrieval] RoT Corpus and Retrieval subsection: The performance improvements rest on the assumption that retrieval from the fixed external RoT corpus consistently supplies contextually complete and unbiased normative principles for unseen multi-turn dynamics. No analysis of corpus coverage, retrieval recall rates, or ablation on low-recall cases is provided, leaving open that the system may revert to base LLM behavior in novel situations.

minor comments (2)

[Abstract] The abstract and methods refer to 'competitive baselines' without naming or citing them explicitly; this should be clarified with precise descriptions or references.
[Results] The term 'distributional error' is used in results but lacks a formal definition or equation; adding one would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each major comment below and describe the revisions we will make to improve the paper's rigor and clarity.

read point-by-point responses

Referee: [Experimental Evaluation] The central claims of 40% relative F1 gain and 8.4% distributional error reduction are reported as summary statistics without statistical significance tests, confidence intervals, per-dataset breakdowns, or error analysis. This undermines assessment of whether the improvements are reliable or driven by specific conversation types.

Authors: We agree that the current presentation of aggregate results limits interpretability. In the revised manuscript, we will expand the Experimental Evaluation section to include per-dataset breakdowns for ProsocialDialog and Safety Reasoning Multi Turn Dialogue, 95% confidence intervals, statistical significance tests (e.g., paired t-tests or McNemar's test on F1 improvements), and a dedicated error analysis subsection that categorizes cases by conversation type and highlights where RoTRAG provides the largest gains versus baselines. revision: yes
Referee: [RoT Corpus and Retrieval] The performance improvements rest on the assumption that retrieval from the fixed external RoT corpus consistently supplies contextually complete and unbiased normative principles for unseen multi-turn dynamics. No analysis of corpus coverage, retrieval recall rates, or ablation on low-recall cases is provided, leaving open that the system may revert to base LLM behavior in novel situations.

Authors: We acknowledge this as a valid concern regarding generalization. While the reported gains on both benchmarks indicate that retrieved RoTs provide useful normative grounding in the tested scenarios, we will add retrieval recall statistics and an ablation study on low-recall cases to the RoT Corpus and Retrieval subsection. We will also expand the discussion of limitations to explicitly address potential coverage gaps for novel multi-turn dynamics and note that the routing classifier is designed to fall back to parametric reasoning when retrieval is not triggered. revision: partial

Circularity Check

0 steps flagged

No circularity: framework grounded in external RoT corpus and evaluated on public benchmarks

full rationale

The paper proposes RoTRAG as a retrieval-augmented system that retrieves human-written Rules of Thumb from a fixed external corpus and uses a lightweight routing classifier for efficiency. Performance claims (F1 gains and error reductions) are presented as empirical results on the public ProsocialDialog and Safety Reasoning Multi Turn Dialogue datasets. No equations, derivations, or self-citations in the abstract reduce the claimed improvements to quantities fitted or defined inside the paper itself. The central claims rest on external normative data and standard benchmark evaluation rather than self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the premise that external human-written normative rules can be retrieved and used effectively by LLMs; the binary router is an added component whose training details are not visible.

axioms (1)

domain assumption LLMs can produce more consistent and interpretable harm judgments when supplied with retrieved normative rules than when relying solely on parametric knowledge.
This is the core justification for adding the retrieval step.

invented entities (1)

Binary routing classifier no independent evidence
purpose: Decides for each turn whether retrieval of new RoTs is required or prior context suffices.
Introduced to reduce redundant computation; no independent evidence of its accuracy is provided in the abstract.

pith-pipeline@v0.9.0 · 5526 in / 1422 out tokens · 53706 ms · 2026-05-10T06:52:27.859270+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 11 canonical work pages · 5 internal anchors

[1]

[n. d.]. Claude 3.7 Sonnet System Card. https://api.semanticscholar.org/CorpusID: 276612236
[2]

Lora Aroyo, Alex Taylor, Mark Diaz, Christopher Homan, Alicia Parrish, Gre- gory Serapio-García, Vinodkumar Prabhakaran, and Ding Wang. 2023. Dices dataset: Diversity in conversational ai evaluation for safety.Advances in Neural Information Processing Systems36 (2023), 53330–53342

2023
[3]

Jonathan P Chang and Cristian Danescu-Niculescu-Mizil. 2019. Trouble on the horizon: Forecasting the derailment of online conversations as they develop. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4743–4754

2019
[4]

Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, et al. 2025. RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration.arXiv preprint arXiv:2509.25271(2025)

work page arXiv 2025
[5]

Emily Dinan, Gavin Abercrombie, Shannon L Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. 2022. SafetyKit: First aid for measuring safety in open-domain conversational systems. InProceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers). 4113–4133

2022
[6]

Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. 2024. Attacks, defenses and evaluations for llm conversation safety: A survey. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers). 6734–6747

2024
[7]

Denis Emelin, Ronan Le Bras, Jena D Hwang, Maxwell Forbes, and Yejin Choi
[8]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Moral stories: Situated reasoning about norms, intents, actions, and their consequences. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 698–718

2021
[9]

Maxwell Forbes, Jena D Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi
[10]

InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Social chemistry 101: Learning to reason about social and moral norms. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 653–670

2020
[11]

Abhiram Rao Gorle, Amit Kumar Singh Yadav, and Tsachy Weissman. 2025. Quan- tifying Information Gain and Redundancy in Multi-Turn LLM Conversations. In First Workshop on Multi-Turn Interactions in Large Language Models

2025
[12]

Weiyang Guo, Jing Li, Wenya Wang, Yu Li, Daojing He, Jun Yu, and Min Zhang
[13]

InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers)

Mtsa: Multi-turn safety alignment for llms through multi-round red- teaming. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 26424–26442
[14]

Hyunwoo Kim, Youngjae Yu, Liwei Jiang, Ximing Lu, Daniel Khashabi, Gunhee Kim, Yejin Choi, and Maarten Sap. 2022. Prosocialdialog: A prosocial backbone for conversational agents. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 4005–4029

2022
[15]

Siwon Kim, Shuyang Dai, Mohammad Kachuee, Shayan Ray, Tara Taghavi, and Sungroh Yoon. 2024. GrounDial: Human-norm grounded safe dialog response generation. InFindings of the Association for Computational Linguistics: EACL

2024
[16]

Martin Kuo, Jianyi Zhang, Aolin Ding, Louis DiValentin, Amin Hass, Benjamin F Morris, Isaac Jacobson, Randolph Linderman, James Kiessling, Nicolas Ramos, et al. 2025. SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues. arXiv preprint arXiv:2506.00668(2025)

work page arXiv 2025
[17]

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120(2025)

work page internal anchor Pith review arXiv 2025
[18]

Juhyeon Lee, Wonduk Seo, Hyunjin An, Seunghyun Lee, and Yi Bu. 2025. Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization. In2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 269–272

2025
[19]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

2020
[20]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

work page internal anchor Pith review arXiv 2025
[21]

Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, and Hao Liu
[22]

Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.arXiv preprint arXiv:2410.12855 (2024)

work page arXiv 2024
[23]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[24]

Andreas Madsen, Sarath Chandar, and Siva Reddy. 2024. Are self-explanations from Large Language Models faithful?. InFindings of the Association for Compu- tational Linguistics: ACL 2024. 295–337

2024
[25]

Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and Dilek Hakkani-Tur. 2023. Using in-context learning to improve dialogue safety. InFindings of the Association for Computational Linguistics: EMNLP 2023. 11882–11910

2023
[26]

GPT OpenAI. 2024. GPT-4o mini: advancing cost-efficient intelligence. https://openai. com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/(2024)

2024
[27]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

2022
[28]

John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, and Ion Androut- sopoulos. 2020. Toxicity detection: Does context really matter?. InProceedings of the 58th annual meeting of the association for computational linguistics. 4296–4305

2020
[29]

Huachuan Qiu, Tong Zhao, Anqi Li, Shuai Zhang, Hongliang He, and Zhenzhong Lan. 2023. A benchmark for understanding dialogue safety in mental health support. InCCF International Conference on Natural Language Processing and Chinese Computing. Springer, 1–13

2023
[30]

Kavel Rao, Liwei Jiang, Valentina Pyatkin, Yuling Gu, Niket Tandon, Nouha Dziri, Faeze Brahman, and Yejin Choi. 2023. What makes it ok to set a fire? Iterative self-distillation of contexts and rationales for disambiguating defeasible social and moral situations. InFindings of the Association for Computational Linguistics: RoTRAG: Rule of Thumb Reasoning ...

2023
[31]

Pritish Sahu, Anirudh Som, Ajay Divakaran, and Dimitra Vergyri. 2025. MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. 2039–2052

2025
[32]

Wonduk Seo, Juhyeon Lee, Junseo Koh, Hyunjin An, Jian Park, Seunghyun Lee, Haihua Chen, and Yi Bu. 2025. Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis.arXiv preprint arXiv:2510.16635(2025)

work page arXiv 2025
[33]

Wonduk Seo, Taesub Shin, Hyunjin An, Dokyun Kim, and Seunghyun Lee. 2025. Question-to-Knowledge (Q2K): Multi-Agent Generation of Inspectable Facts for Product Mapping.arXiv preprint arXiv:2509.01182(2025)

work page arXiv 2025
[34]

Wonduk Seo, Zonghao Yuan, and Yi Bu. 2025. Valuesrag: Enhancing cultural alignment through retrieval-augmented contextual learning. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 8. 2307–2318

2025
[35]

Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models.Nature623, 7987 (2023), 493–498

2023
[36]

Guangzhi Sun, Xiao Zhan, Shutong Feng, Philip C Woodland, and Jose Such
[37]

Woodland, and Jose Such

CASE-Bench: Context-aware safety benchmark for large language models. arXiv preprint arXiv:2501.14940(2025)

work page arXiv 2025
[38]

Hao Sun, Guangxuan Xu, Jiawen Deng, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, and Minlie Huang. 2022. On the safety of conversa- tional models: Taxonomy, dataset, and benchmark. InFindings of the Association for Computational Linguistics: ACL 2022. 3906–3923

2022
[39]

Megan Ung, Jing Xu, and Y-Lan Boureau. 2022. SaFeRDialogues: Taking feedback gracefully after conversational safety failures. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6462–6481

2022
[40]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

work page internal anchor Pith review arXiv 2022
[41]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InThe Eleventh International Conference on Learning Representations

2023
[42]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[43]

Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2021. Bot-adversarial dialogue for safe conversational agents. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2950–2968

2021
[44]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Erxin Yu, Jing Li, Ming Liao, Siqi Wang, Gao Zuchen, Fei Mi, and Lanqing Hong
[46]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Cosafe: Evaluating large language model safety in multi-turn dialogue coreference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 17494–17508

2024
[47]

Xinchen Yu, Eduardo Blanco, and Lingzi Hong. 2022. Hate speech and counter speech detection: Conversational context does matter. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5918–5930

2022
[48]

Haolan Zhan, Zhuang Li, Yufei Wang, Linhao Luo, Tao Feng, Xiaoxi Kang, Yuncheng Hua, Lizhen Qu, Lay-Ki Soon, Suraj Sharma, et al . 2023. Socialdial: A benchmark for socially-aware dialogue systems. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2712–2722

2023
[49]

Mian Zhang, Lifeng Jin, Linfeng Song, Haitao Mi, Wenliang Chen, and Dong Yu. 2023. SafeConv: Explaining and correcting conversational unsafe behavior. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 22–35

2023
[50]

Caleb Ziems, Jane Dwivedi-Yu, Yi-Chia Wang, Alon Halevy, and Diyi Yang. 2023. NormBank: A knowledge bank of situational social norms. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7756–7776. A Classifier Training Details A.1 Train Dataset Construction To construct supervision data for th...

2023
[51]

You shouldn't throw away a roommate's food without their consent
[52]

obviously harmful or illegal

It's inappropriate to dictate others' food choices in a shared living situation. Correspond RoT: RoT 1: It's bad to comment on what other people choose to eat. RoT 2: It's wrong to tell other people what to eat RoT 3: It's arrogant to make decisions for other people, especially their diets. RoT 4: You should not comment on other people's food choices. RoT...

2018
[53]

reasoning

When managing relationship dynamics, you should address your personal insecurities. 2.It's wrong to influence others' insecurities through manipulation and third-party involvement. Correspond RoT: RoT 1: It's wrong to play with people's emotions. RoT 2: It's wrong to manipulate your friends for your own benefit RoT 3: Sometimes you have to tell a lie to g...

2018