pith. sign in

arxiv: 2510.05864 · v2 · pith:RRV6D5WYnew · submitted 2025-10-07 · 💻 cs.CL · cs.CY

On the Sensitivity of Instruction-tuned LLMs to Harmful Sentences in Long Inputs

classification 💻 cs.CL cs.CY
keywords harmfulsentencesinputinputslongllmssensitivityharm
0
0 comments X
read the original abstract

Large language models (LLMs) increasingly operate on long inputs, yet their behavior when harmful sentences are sparsely embedded within such inputs remains poorly understood. We present a sensitivity analysis that probes how LLMs extract harmful sentences embedded in long inputs. We construct long inputs by combining neutral and harmful sentences, and systematically vary four factors: input length (600--30,000 tokens), the proportion of harmful sentences (0.01--0.50), harm realization (explicit vs. implicit), and the position of harmful sentences within the input (beginning, middle, end), enabling a controlled stress-test evaluation. Experiments across toxic, offensive, and hate content, and across LLaMA-3.1, Qwen-2.5, and Mistral, reveal consistent patterns: sensitivity is non-monotonic with respect to harmful prevalence, peaking at moderate levels; sensitivity degrades as input length increases; harmful sentences placed earlier in the input are more strongly prioritized; and explicit harm is more reliably identified than implicit harm. These findings provide a systematic view of how LLMs prioritize harmful sentences in long input under controlled stress conditions, highlighting both emerging strengths and remaining challenges for safety-related use.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.