From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model
Pith reviewed 2026-05-21 08:11 UTC · model grok-4.3
The pith
Paired analysis shows 61% of LLM responses reduce harm from the input prompt while 3% escalate it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that safety evaluations must move from isolated prompt or response classification to paired prompt-response records; when this is done with ordinal severity labels, 61% of responses reduce harm, 36% preserve severity, and 3% escalate, with escalation occurring either through unrequested harmful detail on benign prompts or through on-task answers at higher severity, and with Sexual content showing the highest persistence via same-severity compliance.
What carries the argument
Paired ordinal severity labeling (Safe, Low, Medium, High) of both prompt and response, which tracks harm change rather than binary outcomes.
If this is right
- Most LLM outputs decrease the harm severity of the original prompt.
- Escalations arise from either adding unrequested harmful content or increasing severity while remaining on task.
- Sexual content shows the highest harm persistence, driven by compliance at the same severity level.
- Safe refusals tend to have low relevance, exposing a helpfulness-harmlessness tradeoff.
- Few-shot LLM graders detect risk more readily in prompts than in responses even after calibration.
Where Pith is reading between the lines
- The paired method could be reused on newer models or additional categories to measure whether safety training reduces escalation rates over time.
- Alignment efforts might target the two distinct escalation mechanisms separately rather than treating all harm uniformly.
- Better refusal templates could aim to preserve relevance while still lowering severity, addressing the observed tradeoff.
Load-bearing premise
Human annotations assign accurate and consistent ordinal severity levels to prompts and responses across all categories.
What would settle it
Independent re-labeling of the same prompt-response pairs that produces materially different percentages for reduction, preservation, and escalation.
Figures
read the original abstract
Safety evaluations of large language models (LLMs) typically report binary outcomes, i.e. attack success rate (ASR), refusal rate, or harmful versus safe classification, which hide how risk changes between prompt and response. We present a paired analysis over human labeled prompt and response records across four harm categories (Sexual, Self harm, Hate and Violence) and ordinal severity levels (Safe, Low, Medium, High). 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate. The escalation splits into two mechanisms: benign prompts triggering unrequested harmful detail, and answers that stay on task at higher severity than the prompt. Category decomposition shows that Sexual content exhibits the highest harm persistence in this sample, driven by compliance at the same severity rather than drift from benign inputs. Joint relevance analysis exposes a helpfulness versus harmlessness tradeoff: compliance escalations remain highly relevant, whereas safe responses include generic refusals with low relevance. Finally, few-shot LLM graders exhibit a prompt/response detection asymmetry that data calibration does not close. Grader prompts are shared at https://github.com/microsoft/PairedSafety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts a paired analysis of human-labeled LLM prompts and responses across four harm categories (Sexual, Self-harm, Hate, Violence) using ordinal severity levels (Safe, Low, Medium, High). It reports that 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate, with escalation split into two mechanisms (benign prompts triggering unrequested harmful detail; on-task answers at higher severity). Sexual content shows highest harm persistence via compliance at the same severity; the work also examines relevance tradeoffs between helpfulness and harmlessness plus asymmetries in few-shot LLM graders, with grader prompts shared on GitHub.
Significance. If the human annotations are reliable and consistent, the paired prompt-response framing offers a useful advance over binary safety metrics by quantifying risk changes and identifying specific mechanisms and tradeoffs. The public release of grader prompts supports reproducibility. The central empirical counts could help prioritize safety interventions if the labeling foundation is strengthened.
major comments (1)
- [Methods / Annotation] Methods section (annotation procedure): No sample size, inter-annotator agreement (e.g., Cohen's or Fleiss' kappa), exclusion criteria, calibration procedure, or cross-category anchoring for the ordinal severity labels is reported. The headline 61/36/3 split and category decompositions (including Sexual persistence driven by same-severity compliance) are direct arithmetic from these labels; without consistency checks the reduction/preservation/escalation rates and the two escalation mechanisms cannot be interpreted reliably.
minor comments (2)
- [Abstract] Abstract and results: Adding the total number of prompt-response pairs would help readers assess the precision of the reported percentages.
- [Results] Relevance analysis: The helpfulness-harmlessness tradeoff is noted but lacks detail on how relevance was defined or measured (e.g., annotation rubric or automated metric).
Simulated Author's Rebuttal
We thank the referee for their constructive review and for acknowledging the potential value of the paired prompt-response framing over binary metrics. We address the major comment below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Methods / Annotation] Methods section (annotation procedure): No sample size, inter-annotator agreement (e.g., Cohen's or Fleiss' kappa), exclusion criteria, calibration procedure, or cross-category anchoring for the ordinal severity labels is reported. The headline 61/36/3 split and category decompositions (including Sexual persistence driven by same-severity compliance) are direct arithmetic from these labels; without consistency checks the reduction/preservation/escalation rates and the two escalation mechanisms cannot be interpreted reliably.
Authors: We agree that these details are critical for assessing label reliability and thus the robustness of the reported 61/36/3 distribution and category-specific mechanisms. The submitted manuscript does not explicitly report sample size, inter-annotator agreement, exclusion criteria, calibration procedure, or cross-category anchoring. We will revise the Methods section to include this information from our annotation records: the total number of prompt-response pairs, any computed agreement metrics (such as Cohen's kappa for overlapping annotations), exclusion rules applied, the calibration process used for the ordinal scale, and how labels were anchored across the four harm categories. These additions will support interpretation of the reduction, preservation, and escalation rates without changing the core empirical results. revision: yes
Circularity Check
No circularity: direct empirical counts from human labels
full rationale
The paper conducts a paired empirical analysis by comparing human-annotated ordinal severity levels (Safe, Low, Medium, High) between prompts and responses across four harm categories. The headline statistics (61% reduce harm, 36% preserve severity, 3% escalate) and category decompositions are computed directly from these label differences with no equations, fitted parameters, model derivations, or self-referential steps. No load-bearing claim reduces to a self-citation chain, ansatz, or input-by-construction; the results are independent observations from the labeled dataset and remain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human-provided ordinal severity labels for prompts and responses are reliable and consistent across annotators.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate... Category decomposition shows that Sexual content exhibits the highest harm persistence
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ordinal severity levels (Safe, Low, Medium, High) for both prompts and responses
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862. Bradley Efron and Robert J. Tibshirani. 1993.An Intro- duction to the Bootstrap. Chapman and Hall/CRC. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, ...
work page internal anchor Pith review Pith/arXiv arXiv 1993
-
[2]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Red teaming language models to reduce harms: Meth- ods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858. Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Llama guard: LLM- based input-output safeguard for human-AI conver- sations.arXiv preprint arXiv:2312.06674. Yichao Ji
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
https://manus.im /blog/Context-Engineering-for-AI-Agents-L essons-from-Building-Manus
Context engineering for AI agents: Lessons from building Manus. https://manus.im /blog/Context-Engineering-for-AI-Agents-L essons-from-Building-Manus . Accessed: 2025- 07-18. Todor Markov, Chong Zhang, Sandhini Agarwal, Flo- rentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng
work page 2025
-
[5]
GPT-4 technical report.arXiv preprint arXiv:2303.08774. Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
XSTest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL), pages 5377–5400. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin
work page 2024
-
[7]
InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 896–911
Do-Not-Answer: Eval- uating safeguards in LLMs. InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 896–911. Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, ...
work page 2024
-
[8]
Taxonomy of risks posed by language mod- els. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 214–229. Edwin B. Wilson
work page 2022
-
[9]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Univer- sal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043. A Case Examples Content warning:this appendix shows represen- tative prompts and responses involving harmful content categories. The marker [. . .] is usedonlyto indicate length-only omissions of contiguous spans of text in the longest response. Ex...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.