Toxicity Detection Should Measure Contextual Harm, Not Text-Intrinsic Badness

Noel Crespi; Reza Farahbakhsh; Sergei Berezin

arxiv: 2503.16072 · v4 · submitted 2025-03-20 · 💻 cs.LG · cs.AI· cs.CL

Toxicity Detection Should Measure Contextual Harm, Not Text-Intrinsic Badness

Sergei Berezin , Reza Farahbakhsh , Noel Crespi This is my paper

Pith reviewed 2026-05-22 23:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords toxicity detectioncontextual harmcommunicative harmnorm violationcontent moderationAI safetylanguage models

0 comments

The pith

Toxicity detection should measure contextual communicative harm rather than intrinsic text properties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current toxicity detectors classify isolated text as good or bad, but this paper claims toxicity only arises when a message is interpreted by an audience inside a specific social and normative setting. It introduces the Contextual Stress Framework to define toxicity as the relation between a perceived norm violation and the stress or disruption that follows. This account explains why text-only systems overflag reclaimed language or dialect and miss coded abuse that depends on context. The paper proposes CSF-Eval to break evaluation into separate components of risk, violation, disruption, uncertainty, and policy action. Adopting the view would change how safety systems for platforms and language models assess harm.

Core claim

What carries the argument

The Contextual Stress Framework (CSF), which defines toxicity as a relation between perceived norm violation and induced stress or disruption and explains the limitations of text-intrinsic detectors.

If this is right

Text-intrinsic detectors would be recognized as insufficient because they overflag dialectal or reclaimed language.
Coded or pragmatic abuse that depends on audience interpretation would become detectable.
Detectors would show less brittleness when text undergoes meaning-preserving transformations.
Evaluation would separate text risk, norm violation, disruption, uncertainty, and policy action rather than using a single label.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Moderation systems might need new data sources that capture audience demographics or community norms alongside the text.
The framework could link toxicity detection more closely to concepts from pragmatics and sociolinguistics.
Training data for language models might shift toward annotations that record contextual stress rather than binary toxicity labels.
Platform policies could incorporate uncertainty estimates from CSF-Eval when deciding on content removal.

Load-bearing premise

That redefining toxicity as a relation between perceived norm violation and induced stress will produce measurably better detectors and evaluations.

What would settle it

A head-to-head test on context-dependent cases such as reclaimed language or pragmatic abuse where CSF-Eval detectors show no reduction in false positives or missed harms compared with standard text classifiers.

Figures

Figures reproduced from arXiv: 2503.16072 by Noel Crespi, Reza Farahbakhsh, Sergei Berezin.

read the original abstract

Toxicity detection has become core safety infrastructure for online moderation, dataset filtering, and deployed language-model systems. Yet most detectors still treat toxicity as an intrinsic property of isolated text. This position paper argues that toxicity detection should be evaluated as the contextual measurement of situated communicative harm, rather than as single-label text classification. Toxicity is not contained in words alone; it emerges when a communicative act is interpreted by an audience within a normative and social context. We introduce the Contextual Stress Framework (CSF), which defines toxicity as a relation between perceived norm violation and induced stress or disruption. CSF explains why text-intrinsic detectors overflag dialectal or reclaimed language, miss coded or pragmatic abuse, and remain brittle under meaning-preserving transformations. We propose CSF-Eval, an evaluation agenda that separates text risk, norm violation, disruption, uncertainty, and policy action.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a position paper that restates the case for context in toxicity detection but adds no data, experiments, or measurable improvement over existing arguments.

read the letter

The main thing to know is that the paper argues toxicity should be treated as contextual harm rather than a property of the text alone, and it introduces the Contextual Stress Framework (CSF) plus a CSF-Eval agenda to separate norm violation, disruption, and policy action. The argument is internally consistent and correctly notes that text-only classifiers overflag dialect or reclaimed language and miss pragmatic cases. That part is clear and draws on familiar points from pragmatics and sociolinguistics without contradiction. What is new is mainly the named framework and the proposed evaluation breakdown; the underlying observation is not. The paper does a decent job of spelling out the practical problems with current detectors in moderation and dataset work. The soft spot is the absence of any test, derivation, or even illustrative example showing that CSF leads to better detectors or fewer errors. The claim that this approach will reorganize safety infrastructure rests on the untested premise that redefining the problem this way will produce measurable gains. No code, data, or formal check is provided, so the advantage stays at the level of assertion. This paper is for researchers already working on AI safety evaluation who want a structured way to discuss context; it will not give practitioners a new method or benchmark they can use tomorrow. A reader looking for empirical grounding will come away empty. It deserves peer review because the topic is central to deployed systems and the framing could sharpen ongoing debates, even though the paper itself is conceptual and would need substantial follow-up work to move beyond position-taking.

Referee Report

2 major / 1 minor

Summary. The paper is a position paper claiming that toxicity detection should be reframed as the contextual measurement of situated communicative harm rather than single-label text classification treating toxicity as text-intrinsic. It introduces the Contextual Stress Framework (CSF) defining toxicity as a relation between perceived norm violation and induced stress or disruption. CSF is asserted to explain limitations of current detectors (overflagging dialectal language, missing coded abuse, brittleness to transformations), and CSF-Eval is proposed to separate text risk, norm violation, disruption, uncertainty, and policy action.

Significance. If operationalized, the reframing could advance the field toward more context-sensitive and equitable detectors by addressing pragmatic and normative factors. The paper correctly flags brittleness under meaning-preserving transformations as a limitation of intrinsic approaches. As a purely conceptual position paper, however, it provides no empirical validation, datasets, or derivations, so any significance remains prospective; no machine-checked proofs, reproducible code, or falsifiable predictions are present.

major comments (2)

[Abstract] Abstract: The claim that CSF 'explains why text-intrinsic detectors overflag dialectal or reclaimed language, miss coded or pragmatic abuse' is load-bearing for the central argument that the framework improves on existing methods, yet it follows only from definitional assertion without any concrete example, case analysis, or derivation showing how the norm-violation/stress relation would change detection outcomes.
[Abstract] Abstract (CSF-Eval proposal): The separation of text risk, norm violation, disruption, uncertainty, and policy action is central to the proposed evaluation agenda, but the manuscript supplies no indication of how these components would be measured, annotated, or validated in practice, leaving the claim that CSF-Eval constitutes a superior agenda without testable substance.

minor comments (1)

[Abstract] The abstract would benefit from brief references to specific common toxicity datasets or models when critiquing 'single-label text classification' to aid reader grounding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our position paper. We address each major comment below. As the paper is explicitly conceptual, we clarify the scope of our claims while agreeing to strengthen substantiation where feasible through revision.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that CSF 'explains why text-intrinsic detectors overflag dialectal or reclaimed language, miss coded or pragmatic abuse' is load-bearing for the central argument that the framework improves on existing methods, yet it follows only from definitional assertion without any concrete example, case analysis, or derivation showing how the norm-violation/stress relation would change detection outcomes.

Authors: The referee correctly notes that the abstract states the explanatory role of CSF without examples. The full manuscript derives these explanations from the CSF definitions in Section 3 (e.g., how perceived norm violation differs for dialectal language versus standard forms, leading to differential stress induction). However, to make this more accessible, we will add a new subsection with 2-3 concrete case analyses showing how the norm-violation/stress relation alters detection outcomes compared to intrinsic approaches. revision: yes
Referee: [Abstract] Abstract (CSF-Eval proposal): The separation of text risk, norm violation, disruption, uncertainty, and policy action is central to the proposed evaluation agenda, but the manuscript supplies no indication of how these components would be measured, annotated, or validated in practice, leaving the claim that CSF-Eval constitutes a superior agenda without testable substance.

Authors: We agree that the abstract and proposal section present CSF-Eval at a high level without operational details. The manuscript positions CSF-Eval as an agenda (Section 4) rather than an implemented protocol. To address the concern, we will expand the revision with high-level sketches of measurement approaches (e.g., crowdsourced annotation for norm violation using context prompts, stress proxies via user-reported disruption scales) while preserving the position-paper scope; full validation remains future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a position paper whose central contribution is a normative redefinition of toxicity via the introduced Contextual Stress Framework (CSF). No equations, fitted parameters, derivations, or quantitative predictions appear in the abstract or described structure. The CSF definition (toxicity as relation between perceived norm violation and induced stress) is presented as an explicit definitional framework rather than a result derived from prior inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The argument remains self-contained as conceptual advocacy without reducing any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that toxicity is definitionally a contextual relation rather than an intrinsic property; no free parameters or invented physical entities appear.

axioms (1)

domain assumption Toxicity emerges when a communicative act is interpreted by an audience within a normative and social context rather than being contained in words alone.
Stated directly in the abstract as the foundational premise for rejecting text-intrinsic detection.

invented entities (1)

Contextual Stress Framework (CSF) no independent evidence
purpose: To define toxicity as the relation between perceived norm violation and induced stress or disruption.
Newly introduced in the abstract to organize the argument; no independent evidence or falsifiable prediction supplied.

pith-pipeline@v0.9.0 · 5679 in / 1301 out tokens · 47780 ms · 2026-05-22T23:26:54.161289+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Preprint, arXiv:2409.18708

Read over the lines: Attacking llms and toxic- ity detection systems with ascii art to mask profanity. Preprint, arXiv:2409.18708. Sergey Berezin, Reza Farahbakhsh, and Noel Crespi

work page arXiv
[2]

Preprint, arXiv:2501.18626

The tip of the iceberg: Revealing a hidden class of task-in-prompt adversarial attacks on llms. Preprint, arXiv:2501.18626. Cristina Bicchieri. 2005.The Grammar of Society: The Nature and Dynamics of Social Norms. Cambridge University Press. Alexander Brown. 2017. What is hate speech? part 2: Family resemblances.Law and Philosophy, 36:1–53. Penelope Brown...

work page arXiv 2005
[3]

volume 47 of Advances in Experimental Social Psychology, pages 55–130

Chapter two - moral foundations theory: The pragmatic validity of moral pluralism. volume 47 of Advances in Experimental Social Psychology, pages 55–130. Academic Press. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Wein- berger. 2017. On calibration of modern neural net- works. InProceedings of the 34th International Con- ference on Machine Learning - V...

work page 2017
[4]

InProceedings of the Fourth Workshop on Online Abuse and Harms, pages 150–161, Online

In data we trust: A critical analysis of hate speech detection datasets. InProceedings of the Fourth Workshop on Online Abuse and Harms, pages 150–161, Online. Association for Computational Lin- guistics. Bruce S. McEwen. 1998. Protective and damaging effects of stress mediators.New England Journal of Medicine, 338(3):171–179. Stefano Menini, Alessio Palm...

work page arXiv 1998
[5]

Fred Shaffer, Rollin McCraty, and Christopher L

The role of emotions in collective responses to in-group norm violations: The case of university’s sensitivity to the natural environment norm.Current Psychology, 43:27187–27206. Fred Shaffer, Rollin McCraty, and Christopher L. Zerr

work page
[6]

Qwen3 Technical Report

A healthy heart is not a metronome: an inte- grative review of the heart’s anatomy and heart rate variability.Frontiers in Psychology, 5:1040. Nandita Sharma and Tom Gedeon. 2012. Objective measures, sensors and computational techniques for stress recognition and classification: A survey. Computer methods and programs in biomedicine, 108(3):1287–1301. Geo...

work page internal anchor Pith review Pith/arXiv arXiv 2012
[7]

I totally agree , it's a complete mess

Reply : " I totally agree , it's a complete mess ." - Classification : approval ( comment )

work page
[8]

I think we should give it some time to see the effects

Reply : " I think we should give it some time to see the effects ." - Classification : neutral

work page
[9]

This policy is the worst decision ever made by this administration

Reply : " This policy is the worst decision ever made by this administration ." - Classification : condemnation ( subject )

work page
[10]

I don't have a strong opinion on this

Reply : " I don't have a strong opinion on this ." - Classification : neutral

work page
[11]

LOL , shut up Karen , nobody cares what you think

Reply : " LOL , shut up Karen , nobody cares what you think ." - Classification : condemnation ( comment )

work page
[12]

I kinda like the new policy

Reply : " I kinda like the new policy ." - Classification : approval ( subject ) By following these instructions , you will help us understand the community dynamics and how different reactions are expressed within the r / B la c k Pe o pl e T wi t t er community . Your participation is crucial for the success of this research . Thank you ! E Prompts Sent...

work page

[1] [1]

Preprint, arXiv:2409.18708

Read over the lines: Attacking llms and toxic- ity detection systems with ascii art to mask profanity. Preprint, arXiv:2409.18708. Sergey Berezin, Reza Farahbakhsh, and Noel Crespi

work page arXiv

[2] [2]

Preprint, arXiv:2501.18626

The tip of the iceberg: Revealing a hidden class of task-in-prompt adversarial attacks on llms. Preprint, arXiv:2501.18626. Cristina Bicchieri. 2005.The Grammar of Society: The Nature and Dynamics of Social Norms. Cambridge University Press. Alexander Brown. 2017. What is hate speech? part 2: Family resemblances.Law and Philosophy, 36:1–53. Penelope Brown...

work page arXiv 2005

[3] [3]

volume 47 of Advances in Experimental Social Psychology, pages 55–130

Chapter two - moral foundations theory: The pragmatic validity of moral pluralism. volume 47 of Advances in Experimental Social Psychology, pages 55–130. Academic Press. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Wein- berger. 2017. On calibration of modern neural net- works. InProceedings of the 34th International Con- ference on Machine Learning - V...

work page 2017

[4] [4]

InProceedings of the Fourth Workshop on Online Abuse and Harms, pages 150–161, Online

In data we trust: A critical analysis of hate speech detection datasets. InProceedings of the Fourth Workshop on Online Abuse and Harms, pages 150–161, Online. Association for Computational Lin- guistics. Bruce S. McEwen. 1998. Protective and damaging effects of stress mediators.New England Journal of Medicine, 338(3):171–179. Stefano Menini, Alessio Palm...

work page arXiv 1998

[5] [5]

Fred Shaffer, Rollin McCraty, and Christopher L

The role of emotions in collective responses to in-group norm violations: The case of university’s sensitivity to the natural environment norm.Current Psychology, 43:27187–27206. Fred Shaffer, Rollin McCraty, and Christopher L. Zerr

work page

[6] [6]

Qwen3 Technical Report

A healthy heart is not a metronome: an inte- grative review of the heart’s anatomy and heart rate variability.Frontiers in Psychology, 5:1040. Nandita Sharma and Tom Gedeon. 2012. Objective measures, sensors and computational techniques for stress recognition and classification: A survey. Computer methods and programs in biomedicine, 108(3):1287–1301. Geo...

work page internal anchor Pith review Pith/arXiv arXiv 2012

[7] [7]

I totally agree , it's a complete mess

Reply : " I totally agree , it's a complete mess ." - Classification : approval ( comment )

work page

[8] [8]

I think we should give it some time to see the effects

Reply : " I think we should give it some time to see the effects ." - Classification : neutral

work page

[9] [9]

This policy is the worst decision ever made by this administration

Reply : " This policy is the worst decision ever made by this administration ." - Classification : condemnation ( subject )

work page

[10] [10]

I don't have a strong opinion on this

Reply : " I don't have a strong opinion on this ." - Classification : neutral

work page

[11] [11]

LOL , shut up Karen , nobody cares what you think

Reply : " LOL , shut up Karen , nobody cares what you think ." - Classification : condemnation ( comment )

work page

[12] [12]

I kinda like the new policy

Reply : " I kinda like the new policy ." - Classification : approval ( subject ) By following these instructions , you will help us understand the community dynamics and how different reactions are expressed within the r / B la c k Pe o pl e T wi t t er community . Your participation is crucial for the success of this research . Thank you ! E Prompts Sent...

work page