Context-Aware Detection and Victim-Centered Response Generation for Online Harassment in Private Messaging

Candice L Biernesser; Emma Win; Jamie Zelazny; Morgan Rose; Munmun De Choudhury; Nimra Ishfaq; Pinxian Lu; Sierra R Strickland

arxiv: 2512.14700 · v2 · pith:JICCTFWTnew · submitted 2025-11-28 · 💻 cs.SI · cs.CL· cs.CY

Context-Aware Detection and Victim-Centered Response Generation for Online Harassment in Private Messaging

Pinxian Lu , Nimra Ishfaq , Emma Win , Morgan Rose , Sierra R Strickland , Candice L Biernesser , Jamie Zelazny , Munmun De Choudhury This is my paper

Pith reviewed 2026-05-21 18:32 UTC · model grok-4.3

classification 💻 cs.SI cs.CLcs.CY

keywords online harassmentprivate messaginglarge language modelscontext-aware detectionvictim-centered responsesadolescent mental healthInstagram direct messagesde-escalation

0 comments

The pith

AI-generated responses to private online harassment are rated significantly more helpful than original participant replies by human evaluators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how large language models can detect harassment unfolding across multiple turns in private messaging and then generate supportive replies for the person being targeted. It draws on a donated set of 80,053 Instagram direct messages from adolescents aged 12 to 18, some with suicide risk factors, to build both a detection system and a response system. The detection pipeline accounts for full conversation context rather than isolated messages, and the response system produces replies focused on emotional support and de-escalation. Human evaluators judged the AI replies more helpful than what the original message recipients had written, with clear advantages in emotional support. A sympathetic reader would care because private harassment often lacks immediate outside help and can escalate without timely intervention.

Core claim

Using a human-labeled dataset of 80,053 private Instagram direct messages from 26 adolescents, the authors show that a context-aware cascading LLM classification pipeline outperforms baseline toxicity classifiers trained on public social media data. They further show that a victim-centered response framework generates AI replies that human evaluators rate as significantly more helpful than the original participant responses, with a 95% CI of 0.767--0.815 and p less than .001, especially for emotional support and de-escalation.

What carries the argument

A context-aware cascading LLM classification pipeline that processes multi-turn private conversations to identify harassment, combined with a victim-centered response generation framework that produces psychologically grounded replies.

If this is right

Context-aware detection can identify harassment patterns in private chats that standard public-data toxicity models miss.
Victim-centered AI replies can supply immediate emotional support and de-escalation during ongoing harassment episodes.
Messaging platforms could embed such systems to deliver just-in-time assistance to adolescents experiencing private harassment.
The approach demonstrates the feasibility of training on donated private conversation data rather than public posts alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such systems might extend support to victims who lack quick access to human counselors or peers during harassment.
The detection and response methods could be tested on other private messaging platforms to check whether performance holds across different user groups.
Real-world deployment would require safeguards for consent, privacy, and accuracy to avoid unintended escalation or mislabeling.

Load-bearing premise

The human-labeled dataset of 80,053 private messages accurately captures context-dependent harassment without substantial labeling bias or loss of conversational history that would affect the cascading classifier and response evaluation.

What would settle it

A new study that applies the same pipeline and response framework to a fresh set of private messages and finds that human evaluators no longer rate the AI replies as significantly more helpful than originals on emotional support or de-escalation measures would challenge the central result.

Figures

Figures reproduced from arXiv: 2512.14700 by Candice L Biernesser, Emma Win, Jamie Zelazny, Morgan Rose, Munmun De Choudhury, Nimra Ishfaq, Pinxian Lu, Sierra R Strickland.

**Figure 1.** Figure 1: LLM classification pipeline structure window of Large language models and construct the corresponding conversation contexts when labeling each message. We build our classification tool as an LLM pipeline that connects two LLM agents, using Python and VLLM. The model we use is meta-llama/Llama-4-Scout-17B-16EInstruct (meta llama 2025). In the classification pipeline, the two agents are given the message … view at source ↗

**Figure 2.** Figure 2: Simulated response pipeline structure is 0. If the classification of the first agent is 1, then the classification of the second agent is the final classification. This method is shown to reduce false positive cases. After all labels are collected from the two agents in a pipeline, the final labels are summarized through cascading classification and compared with the ground truth labels to generate a cl… view at source ↗

read the original abstract

Online harassment is a widespread social and public health concern, yet most computational approaches for detecting and addressing harassment focus on publicly visible social media content rather than private messaging environments. Private conversations present unique challenges because harmful interactions often unfold through context-dependent, multi-turn exchanges, while victims may lack timely support during moments of harassment. In this study, we investigate how large language models (LLMs) can support both the detection of and response to online harassment in private messaging. Using a dataset of 80,053 Instagram direct messages donated by 26 adolescents aged 12-18, including youth with suicide risk factors, we first construct a human-labeled dataset of online harassment in private conversations and develop a context-aware cascading LLM classification pipeline. The proposed pipeline outperforms baseline toxicity classifiers trained primarily on public social media data. We then develop a victim-centered response framework that produces context-sensitive and psychologically-grounded AI-generated responses to online harassment messages. Human evaluators perceived the AI-generated responses as significantly more helpful than the original participant responses (95% CI: 0.767--0.815, p < .001), particularly in terms of emotional support and de-escalation. Our findings highlight the potential of context-aware and victim-centered AI systems to provide just-in-time support during harassment in private messaging environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies a cascading context-aware LLM to harassment in private teen DMs and reports AI responses rated more helpful than originals, but the human evaluation lacks reported blinding and full context details that could affect the main claim.

read the letter

The main thing to know is that this work takes toxicity detection into private adolescent messaging with a cascading LLM that uses conversation history, then generates victim-centered responses that human raters scored as more helpful than the actual participant replies, backed by a 95% CI of 0.767-0.815 and p less than .001. They also built and labeled a dataset of 80k Instagram DMs from 26 teens aged 12-18, some with suicide risk factors. That dataset and the outperformance over public-data baselines are the concrete advances here. The victim-centered framing, which emphasizes emotional support and de-escalation, is a reasonable shift from pure detection work. The real-data collection and the reported statistical result on helpfulness give the paper some grounding that pure simulation studies lack. The soft spot sits in the response evaluation. Without details on whether raters were blinded to AI versus human origin or received the complete multi-turn history when scoring, the helpfulness difference could partly reflect halo effects around polished AI text or incomplete understanding of the harassment buildup. The abstract and stress-test note both leave this unclear, so the p-value result is harder to interpret than it first appears. Labeling the original messages for context-dependent harassment could carry similar risks if labelers missed subtle history or brought their own biases. This is for researchers working on computational tools for youth online safety or just-in-time mental health support. A reader focused on applied LLM pipelines in private data settings would get value from the dataset scale and the pipeline description. It deserves a serious referee because the domain gap is documented and the empirical piece is present, even if the evaluation protocol needs tightening. I would send it to peer review with targeted questions on blinding and context provision in the rating task.

Referee Report

1 major / 2 minor

Summary. The manuscript develops a context-aware cascading LLM classification pipeline for detecting online harassment in a dataset of 80,053 private Instagram direct messages donated by 26 adolescents (including those with suicide risk factors). It further proposes a victim-centered response generation framework that produces context-sensitive AI replies, claiming the detection pipeline outperforms public-social-media toxicity baselines and that human evaluators rate the AI-generated responses as significantly more helpful than the original participant responses (95% CI: 0.767--0.815, p < .001), especially for emotional support and de-escalation.

Significance. If the evaluation results hold under rigorous controls, the work could meaningfully advance just-in-time support tools for private-messaging harassment, a setting where context-dependent multi-turn dynamics are common and public-data classifiers often fail. The use of a real adolescent donor dataset adds ecological validity, but the overall contribution depends on whether the human-rating comparison can be shown to be free of the methodological confounds noted below.

major comments (1)

[Human evaluation / results section (referenced in abstract)] The central helpfulness claim (95% CI: 0.767--0.815, p < .001) rests on human evaluators comparing AI-generated versus original responses for emotional support and de-escalation. The manuscript provides no information on whether evaluators were blinded to response origin or supplied with the complete multi-turn conversational history (including prior harassment context). Absent these controls, ratings may reflect halo effects around AI fluency or incomplete understanding of context-dependent dynamics rather than genuine superiority, directly weakening the reported statistical result.

minor comments (2)

[Abstract] The abstract states that a human-labeled dataset was constructed but supplies no details on labeling protocol, number of annotators, inter-rater reliability, or how conversational history was presented to labelers; these omissions make it difficult to assess potential labeling bias.
[Methods / pipeline architecture] The description of the cascading classifier would benefit from an explicit diagram or pseudocode showing how context from prior turns is encoded and passed between stages.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting an important methodological detail in the human evaluation. We address the comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Human evaluation / results section (referenced in abstract)] The central helpfulness claim (95% CI: 0.767--0.815, p < .001) rests on human evaluators comparing AI-generated versus original responses for emotional support and de-escalation. The manuscript provides no information on whether evaluators were blinded to response origin or supplied with the complete multi-turn conversational history (including prior harassment context). Absent these controls, ratings may reflect halo effects around AI fluency or incomplete understanding of context-dependent dynamics rather than genuine superiority, directly weakening the reported statistical result.

Authors: We agree that the original manuscript did not sufficiently describe the human evaluation protocol, which is a valid concern. In the revised version we have added a dedicated subsection under Methods that fully specifies the evaluation procedure. Evaluators received the complete multi-turn conversation history (all prior messages up to and including the harassment turn) for every rating task. Responses were presented in randomized order without any labels indicating origin (AI-generated or original participant reply), and evaluators were explicitly instructed that they would see both types of replies. We have also added the exact rating instructions, the number of evaluators, and inter-rater agreement statistics. These controls were in place during data collection; the reported confidence interval and p-value therefore reflect ratings obtained under blinded, context-rich conditions. We believe the revisions eliminate the possibility of halo effects or incomplete context understanding. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external human labels and independent evaluator ratings

full rationale

The paper's core results derive from a human-labeled dataset of 80,053 donated Instagram messages and subsequent ratings by human evaluators comparing AI-generated responses to original participant responses. The reported superiority (95% CI 0.767-0.815, p < .001) is obtained directly from these external annotations and ratings rather than from any model parameter fitted to the target metric, any self-referential definition, or a self-citation chain. No equations, uniqueness theorems, or ansatzes are presented that would reduce the evaluation outcome to the inputs by construction. The classification pipeline and response framework are evaluated against independent human judgments, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work depends on standard assumptions about the validity of human annotations for harassment and the representativeness of the small donor sample; no new free parameters, axioms beyond domain conventions, or invented entities are introduced.

axioms (1)

domain assumption Human labels on private multi-turn messages accurately identify context-dependent harassment
The entire classification pipeline and response evaluation rest on these labels being reliable ground truth.

pith-pipeline@v0.9.0 · 5791 in / 1205 out tokens · 65313 ms · 2026-05-21T18:32:37.538876+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cascading classification method... if the classification of the first agent is 0, then the final classification is 0. If ... 1, then the classification of the second agent is the final classification
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Human evaluators perceived the AI-generated responses as significantly more helpful... (95% CI: 0.767--0.815, p < .001)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Tech Abuse

The psychosocial impacts of cybervictimisation and barriers to seeking social support: Young people’s perspec- tives.Children and youth services review, 111: 104872. Di Capua, M.; Di Nardo, E.; and Petrosino, A. 2016. Un- supervised cyber bullying detection in social networks. In 2016 23rd International conference on pattern recognition (ICPR), 432–437. I...

work page 2016
[2]

meta llama

The effects of cyberbullying victimization on de- pression and suicidal ideation among adolescents and young adults: a three year cohort study from India.BMC psychia- try, 22(1): 599. meta llama. 2025. meta-llama/Llama-4-Scout-17B-16E- Instruct · Hugging Face. Munger, K. 2021. Don’t@ Me: Experimentally reducing partisan incivility on Twitter.Journal of Ex...

work page 2025
[3]

Reusser, A

Tackling Cyberbullying: Review of Empirical Ev- idence Regarding Successful Responses by Students, Par- ents, and Schools.International Journal of Conflict and Vi- olence, 6: 283–292. Reusser, A. I. Y .; Veit, K. M.; Gassin, E. A.; Case, J. P.; and Reusser, G. M. 2021. Assessing the Prevalence of Benev- olence in Response to Online Toxicity on Reddit: A F...

work page 2021
[4]

A Web of Hate: Tackling Hateful Speech in Online Social Spaces

A web of hate: Tackling hateful speech in online social spaces.arXiv preprint arXiv:1709.10159. Sanh, V .; Debut, L.; Chaumond, J.; and Wolf, T. 2019. Dis- tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108. Scheuerman, M. K.; Jiang, J. A.; Fiesler, C.; and Brubaker, J. R. 2021. A framework of severi...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[5]

Understanding abuse: A typology of abusive language detection subtasks.arXiv preprint arXiv:1705.09899. X. 2025. About online abuse. Yin, D.; Xue, Z.; Hong, L.; Davison, B. D.; Kontostathis, A.; Edwards, L.; et al. 2009. Detection of harassment on web 2.0.Proceedings of the Content Analysis in the WEB, 2(0): 1–7. Table 7: Instagram message labeler informa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

If the last message looks like online harassment, you must reply 1 and provide reasoning

work page
[7]

If the last message doesn’t look like online harassment, you must reply 0 and provide reasoning

work page
[8]

If another message in the conversation looks like online harassment, while the last message doesn not, you must reply 0 and provide reasoning

work page
[11]

Here are some suggestions to accurately identify online harassment:

Other messages in the conversation are labeled by other labels. Here are some suggestions to accurately identify online harassment:

work page
[13]

Try not to stretch the meaning of a message

work page
[15]

Here are an additional guidelines:

If the message has unintelligible words or phrases, it may have a typo. Here are an additional guidelines:

work page
[16]

Table 9: Classification pipeline Agent 1 user prompt Classification pipeline Agent 1 user prompt You are given an online conversation

In the message, if there are apparently harmful words targeting someone, then it’s definitely online harassment. Table 9: Classification pipeline Agent 1 user prompt Classification pipeline Agent 1 user prompt You are given an online conversation. You only classify whether the last message is online harassment. The last message is marked with ”(label this...

work page
[17]

If you are confident that the last message is online harassment, you must reply 1 and provide reasoning

work page
[18]

If the last message is not online harassment, you must reply 0 and provide reasoning

work page
[19]

If another message in the conversation is online harassment, while the last message is not, you must reply 0 and provide reasoning

work page
[20]

Give 1 label if the online harassment is targeted at someone

work page
[21]

Meanwhile, you only classify whether the last message is online harassment targeting someone

Take the other messages in the conversation into account when classifying the last message. Meanwhile, you only classify whether the last message is online harassment targeting someone

work page
[22]

Other messages in the conversation are labeled by other labels

work page
[23]

Here are some suggestions to accurately identify online harassment:

You provide label 1 when you are at least kind of sure. Here are some suggestions to accurately identify online harassment:

work page
[24]

Passive-aggressive messages are not online harassment

work page
[25]

Never stretch the meaning of a message

work page
[26]

They can almost never be online harassment

Emojis don’t carry enough meaning. They can almost never be online harassment

work page
[27]

Here are some additional guidelines:

If the message has unintelligible words or phrases, it may have a typo, not online harassment. Here are some additional guidelines:

work page
[28]

In the message, if there are apparently harmful words targeting someone, then it’s definitely online harassment

work page
[29]

Do not overthink the tone of the message

work page
[30]

Do not overthink how one message implies to be sarcastic

work page
[31]

Do not overthink how one message implies to be manipulative

work page
[32]

You must never use the word ”imply” in your reasoning

work page
[33]

Reasoning:

Generally speaking, online harassment is rare among ordinary conversations. Table 11: Classification pipeline Agent 2 user prompt Classification pipeline Agent 2 user prompt You are given an online conversation. You only classify whether the last message is online harassment. The last message is marked with ”(label this message)”. The definition of online...

work page 2020

[1] [1]

Tech Abuse

The psychosocial impacts of cybervictimisation and barriers to seeking social support: Young people’s perspec- tives.Children and youth services review, 111: 104872. Di Capua, M.; Di Nardo, E.; and Petrosino, A. 2016. Un- supervised cyber bullying detection in social networks. In 2016 23rd International conference on pattern recognition (ICPR), 432–437. I...

work page 2016

[2] [2]

meta llama

The effects of cyberbullying victimization on de- pression and suicidal ideation among adolescents and young adults: a three year cohort study from India.BMC psychia- try, 22(1): 599. meta llama. 2025. meta-llama/Llama-4-Scout-17B-16E- Instruct · Hugging Face. Munger, K. 2021. Don’t@ Me: Experimentally reducing partisan incivility on Twitter.Journal of Ex...

work page 2025

[3] [3]

Reusser, A

Tackling Cyberbullying: Review of Empirical Ev- idence Regarding Successful Responses by Students, Par- ents, and Schools.International Journal of Conflict and Vi- olence, 6: 283–292. Reusser, A. I. Y .; Veit, K. M.; Gassin, E. A.; Case, J. P.; and Reusser, G. M. 2021. Assessing the Prevalence of Benev- olence in Response to Online Toxicity on Reddit: A F...

work page 2021

[4] [4]

A Web of Hate: Tackling Hateful Speech in Online Social Spaces

A web of hate: Tackling hateful speech in online social spaces.arXiv preprint arXiv:1709.10159. Sanh, V .; Debut, L.; Chaumond, J.; and Wolf, T. 2019. Dis- tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108. Scheuerman, M. K.; Jiang, J. A.; Fiesler, C.; and Brubaker, J. R. 2021. A framework of severi...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[5] [5]

Understanding abuse: A typology of abusive language detection subtasks.arXiv preprint arXiv:1705.09899. X. 2025. About online abuse. Yin, D.; Xue, Z.; Hong, L.; Davison, B. D.; Kontostathis, A.; Edwards, L.; et al. 2009. Detection of harassment on web 2.0.Proceedings of the Content Analysis in the WEB, 2(0): 1–7. Table 7: Instagram message labeler informa...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

If the last message looks like online harassment, you must reply 1 and provide reasoning

work page

[7] [7]

If the last message doesn’t look like online harassment, you must reply 0 and provide reasoning

work page

[8] [8]

If another message in the conversation looks like online harassment, while the last message doesn not, you must reply 0 and provide reasoning

work page

[9] [11]

Here are some suggestions to accurately identify online harassment:

Other messages in the conversation are labeled by other labels. Here are some suggestions to accurately identify online harassment:

work page

[10] [13]

Try not to stretch the meaning of a message

work page

[11] [15]

Here are an additional guidelines:

If the message has unintelligible words or phrases, it may have a typo. Here are an additional guidelines:

work page

[12] [16]

Table 9: Classification pipeline Agent 1 user prompt Classification pipeline Agent 1 user prompt You are given an online conversation

In the message, if there are apparently harmful words targeting someone, then it’s definitely online harassment. Table 9: Classification pipeline Agent 1 user prompt Classification pipeline Agent 1 user prompt You are given an online conversation. You only classify whether the last message is online harassment. The last message is marked with ”(label this...

work page

[13] [17]

If you are confident that the last message is online harassment, you must reply 1 and provide reasoning

work page

[14] [18]

If the last message is not online harassment, you must reply 0 and provide reasoning

work page

[15] [19]

If another message in the conversation is online harassment, while the last message is not, you must reply 0 and provide reasoning

work page

[16] [20]

Give 1 label if the online harassment is targeted at someone

work page

[17] [21]

Meanwhile, you only classify whether the last message is online harassment targeting someone

Take the other messages in the conversation into account when classifying the last message. Meanwhile, you only classify whether the last message is online harassment targeting someone

work page

[18] [22]

Other messages in the conversation are labeled by other labels

work page

[19] [23]

Here are some suggestions to accurately identify online harassment:

You provide label 1 when you are at least kind of sure. Here are some suggestions to accurately identify online harassment:

work page

[20] [24]

Passive-aggressive messages are not online harassment

work page

[21] [25]

Never stretch the meaning of a message

work page

[22] [26]

They can almost never be online harassment

Emojis don’t carry enough meaning. They can almost never be online harassment

work page

[23] [27]

Here are some additional guidelines:

If the message has unintelligible words or phrases, it may have a typo, not online harassment. Here are some additional guidelines:

work page

[24] [28]

In the message, if there are apparently harmful words targeting someone, then it’s definitely online harassment

work page

[25] [29]

Do not overthink the tone of the message

work page

[26] [30]

Do not overthink how one message implies to be sarcastic

work page

[27] [31]

Do not overthink how one message implies to be manipulative

work page

[28] [32]

You must never use the word ”imply” in your reasoning

work page

[29] [33]

Reasoning:

Generally speaking, online harassment is rare among ordinary conversations. Table 11: Classification pipeline Agent 2 user prompt Classification pipeline Agent 2 user prompt You are given an online conversation. You only classify whether the last message is online harassment. The last message is marked with ”(label this message)”. The definition of online...

work page 2020