Toxicity Detection Should Measure Contextual Harm, Not Text-Intrinsic Badness
Pith reviewed 2026-05-22 23:26 UTC · model grok-4.3
The pith
Toxicity detection should measure contextual communicative harm rather than intrinsic text properties.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Toxicity detection has become core safety infrastructure for online moderation, dataset filtering, and deployed language-model systems. Yet most detectors still treat toxicity as an intrinsic property of isolated text. This position paper argues that toxicity detection should be evaluated as the contextual measurement of situated communicative harm, rather than as single-label text classification. Toxicity is not contained in words alone; it emerges when a communicative act is interpreted by an audience within a normative and social context. We introduce the Contextual Stress Framework (CSF), which defines toxicity as a relation between perceived norm violation and induced stress or disrupt
What carries the argument
The Contextual Stress Framework (CSF), which defines toxicity as a relation between perceived norm violation and induced stress or disruption and explains the limitations of text-intrinsic detectors.
If this is right
- Text-intrinsic detectors would be recognized as insufficient because they overflag dialectal or reclaimed language.
- Coded or pragmatic abuse that depends on audience interpretation would become detectable.
- Detectors would show less brittleness when text undergoes meaning-preserving transformations.
- Evaluation would separate text risk, norm violation, disruption, uncertainty, and policy action rather than using a single label.
Where Pith is reading between the lines
- Moderation systems might need new data sources that capture audience demographics or community norms alongside the text.
- The framework could link toxicity detection more closely to concepts from pragmatics and sociolinguistics.
- Training data for language models might shift toward annotations that record contextual stress rather than binary toxicity labels.
- Platform policies could incorporate uncertainty estimates from CSF-Eval when deciding on content removal.
Load-bearing premise
That redefining toxicity as a relation between perceived norm violation and induced stress will produce measurably better detectors and evaluations.
What would settle it
A head-to-head test on context-dependent cases such as reclaimed language or pragmatic abuse where CSF-Eval detectors show no reduction in false positives or missed harms compared with standard text classifiers.
Figures
read the original abstract
Toxicity detection has become core safety infrastructure for online moderation, dataset filtering, and deployed language-model systems. Yet most detectors still treat toxicity as an intrinsic property of isolated text. This position paper argues that toxicity detection should be evaluated as the contextual measurement of situated communicative harm, rather than as single-label text classification. Toxicity is not contained in words alone; it emerges when a communicative act is interpreted by an audience within a normative and social context. We introduce the Contextual Stress Framework (CSF), which defines toxicity as a relation between perceived norm violation and induced stress or disruption. CSF explains why text-intrinsic detectors overflag dialectal or reclaimed language, miss coded or pragmatic abuse, and remain brittle under meaning-preserving transformations. We propose CSF-Eval, an evaluation agenda that separates text risk, norm violation, disruption, uncertainty, and policy action.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a position paper claiming that toxicity detection should be reframed as the contextual measurement of situated communicative harm rather than single-label text classification treating toxicity as text-intrinsic. It introduces the Contextual Stress Framework (CSF) defining toxicity as a relation between perceived norm violation and induced stress or disruption. CSF is asserted to explain limitations of current detectors (overflagging dialectal language, missing coded abuse, brittleness to transformations), and CSF-Eval is proposed to separate text risk, norm violation, disruption, uncertainty, and policy action.
Significance. If operationalized, the reframing could advance the field toward more context-sensitive and equitable detectors by addressing pragmatic and normative factors. The paper correctly flags brittleness under meaning-preserving transformations as a limitation of intrinsic approaches. As a purely conceptual position paper, however, it provides no empirical validation, datasets, or derivations, so any significance remains prospective; no machine-checked proofs, reproducible code, or falsifiable predictions are present.
major comments (2)
- [Abstract] Abstract: The claim that CSF 'explains why text-intrinsic detectors overflag dialectal or reclaimed language, miss coded or pragmatic abuse' is load-bearing for the central argument that the framework improves on existing methods, yet it follows only from definitional assertion without any concrete example, case analysis, or derivation showing how the norm-violation/stress relation would change detection outcomes.
- [Abstract] Abstract (CSF-Eval proposal): The separation of text risk, norm violation, disruption, uncertainty, and policy action is central to the proposed evaluation agenda, but the manuscript supplies no indication of how these components would be measured, annotated, or validated in practice, leaving the claim that CSF-Eval constitutes a superior agenda without testable substance.
minor comments (1)
- [Abstract] The abstract would benefit from brief references to specific common toxicity datasets or models when critiquing 'single-label text classification' to aid reader grounding.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our position paper. We address each major comment below. As the paper is explicitly conceptual, we clarify the scope of our claims while agreeing to strengthen substantiation where feasible through revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that CSF 'explains why text-intrinsic detectors overflag dialectal or reclaimed language, miss coded or pragmatic abuse' is load-bearing for the central argument that the framework improves on existing methods, yet it follows only from definitional assertion without any concrete example, case analysis, or derivation showing how the norm-violation/stress relation would change detection outcomes.
Authors: The referee correctly notes that the abstract states the explanatory role of CSF without examples. The full manuscript derives these explanations from the CSF definitions in Section 3 (e.g., how perceived norm violation differs for dialectal language versus standard forms, leading to differential stress induction). However, to make this more accessible, we will add a new subsection with 2-3 concrete case analyses showing how the norm-violation/stress relation alters detection outcomes compared to intrinsic approaches. revision: yes
-
Referee: [Abstract] Abstract (CSF-Eval proposal): The separation of text risk, norm violation, disruption, uncertainty, and policy action is central to the proposed evaluation agenda, but the manuscript supplies no indication of how these components would be measured, annotated, or validated in practice, leaving the claim that CSF-Eval constitutes a superior agenda without testable substance.
Authors: We agree that the abstract and proposal section present CSF-Eval at a high level without operational details. The manuscript positions CSF-Eval as an agenda (Section 4) rather than an implemented protocol. To address the concern, we will expand the revision with high-level sketches of measurement approaches (e.g., crowdsourced annotation for norm violation using context prompts, stress proxies via user-reported disruption scales) while preserving the position-paper scope; full validation remains future work. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a position paper whose central contribution is a normative redefinition of toxicity via the introduced Contextual Stress Framework (CSF). No equations, fitted parameters, derivations, or quantitative predictions appear in the abstract or described structure. The CSF definition (toxicity as relation between perceived norm violation and induced stress) is presented as an explicit definitional framework rather than a result derived from prior inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The argument remains self-contained as conceptual advocacy without reducing any claim to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Toxicity emerges when a communicative act is interpreted by an audience within a normative and social context rather than being contained in words alone.
invented entities (1)
-
Contextual Stress Framework (CSF)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Read over the lines: Attacking llms and toxic- ity detection systems with ascii art to mask profanity. Preprint, arXiv:2409.18708. Sergey Berezin, Reza Farahbakhsh, and Noel Crespi
-
[2]
The tip of the iceberg: Revealing a hidden class of task-in-prompt adversarial attacks on llms. Preprint, arXiv:2501.18626. Cristina Bicchieri. 2005.The Grammar of Society: The Nature and Dynamics of Social Norms. Cambridge University Press. Alexander Brown. 2017. What is hate speech? part 2: Family resemblances.Law and Philosophy, 36:1–53. Penelope Brown...
-
[3]
volume 47 of Advances in Experimental Social Psychology, pages 55–130
Chapter two - moral foundations theory: The pragmatic validity of moral pluralism. volume 47 of Advances in Experimental Social Psychology, pages 55–130. Academic Press. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Wein- berger. 2017. On calibration of modern neural net- works. InProceedings of the 34th International Con- ference on Machine Learning - V...
work page 2017
-
[4]
InProceedings of the Fourth Workshop on Online Abuse and Harms, pages 150–161, Online
In data we trust: A critical analysis of hate speech detection datasets. InProceedings of the Fourth Workshop on Online Abuse and Harms, pages 150–161, Online. Association for Computational Lin- guistics. Bruce S. McEwen. 1998. Protective and damaging effects of stress mediators.New England Journal of Medicine, 338(3):171–179. Stefano Menini, Alessio Palm...
-
[5]
Fred Shaffer, Rollin McCraty, and Christopher L
The role of emotions in collective responses to in-group norm violations: The case of university’s sensitivity to the natural environment norm.Current Psychology, 43:27187–27206. Fred Shaffer, Rollin McCraty, and Christopher L. Zerr
-
[6]
A healthy heart is not a metronome: an inte- grative review of the heart’s anatomy and heart rate variability.Frontiers in Psychology, 5:1040. Nandita Sharma and Tom Gedeon. 2012. Objective measures, sensors and computational techniques for stress recognition and classification: A survey. Computer methods and programs in biomedicine, 108(3):1287–1301. Geo...
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[7]
I totally agree , it's a complete mess
Reply : " I totally agree , it's a complete mess ." - Classification : approval ( comment )
-
[8]
I think we should give it some time to see the effects
Reply : " I think we should give it some time to see the effects ." - Classification : neutral
-
[9]
This policy is the worst decision ever made by this administration
Reply : " This policy is the worst decision ever made by this administration ." - Classification : condemnation ( subject )
-
[10]
I don't have a strong opinion on this
Reply : " I don't have a strong opinion on this ." - Classification : neutral
-
[11]
LOL , shut up Karen , nobody cares what you think
Reply : " LOL , shut up Karen , nobody cares what you think ." - Classification : condemnation ( comment )
-
[12]
Reply : " I kinda like the new policy ." - Classification : approval ( subject ) By following these instructions , you will help us understand the community dynamics and how different reactions are expressed within the r / B la c k Pe o pl e T wi t t er community . Your participation is crucial for the success of this research . Thank you ! E Prompts Sent...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.