From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

Mengya Hu; Qiong Wei; Sandeep Atluri

arxiv: 2604.26052 · v3 · pith:LYYKRBJUnew · submitted 2026-04-28 · 💻 cs.CL

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

Mengya Hu , Qiong Wei , Sandeep Atluri This is my paper

Pith reviewed 2026-05-21 08:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM safety evaluationpaired prompt response analysisharm severity levelsescalation mechanismshelpfulness harmlessness tradeoffordinal harm labeling

0 comments

The pith

Paired analysis shows 61% of LLM responses reduce harm from the input prompt while 3% escalate it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a paired comparison of human-labeled prompts and responses across four harm categories using ordinal severity levels. It establishes concrete rates at which responses lower, maintain, or raise harm relative to the prompt, along with the specific mechanisms driving the small escalation share and category differences. A sympathetic reader would care because binary safety metrics miss these dynamics, and the paired view reveals where compliance and relevance trade off against harm reduction. The analysis also shows that few-shot LLM-based grading exhibits an asymmetry between prompt and response detection.

Core claim

The central claim is that safety evaluations must move from isolated prompt or response classification to paired prompt-response records; when this is done with ordinal severity labels, 61% of responses reduce harm, 36% preserve severity, and 3% escalate, with escalation occurring either through unrequested harmful detail on benign prompts or through on-task answers at higher severity, and with Sexual content showing the highest persistence via same-severity compliance.

What carries the argument

Paired ordinal severity labeling (Safe, Low, Medium, High) of both prompt and response, which tracks harm change rather than binary outcomes.

If this is right

Most LLM outputs decrease the harm severity of the original prompt.
Escalations arise from either adding unrequested harmful content or increasing severity while remaining on task.
Sexual content shows the highest harm persistence, driven by compliance at the same severity level.
Safe refusals tend to have low relevance, exposing a helpfulness-harmlessness tradeoff.
Few-shot LLM graders detect risk more readily in prompts than in responses even after calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The paired method could be reused on newer models or additional categories to measure whether safety training reduces escalation rates over time.
Alignment efforts might target the two distinct escalation mechanisms separately rather than treating all harm uniformly.
Better refusal templates could aim to preserve relevance while still lowering severity, addressing the observed tradeoff.

Load-bearing premise

Human annotations assign accurate and consistent ordinal severity levels to prompts and responses across all categories.

What would settle it

Independent re-labeling of the same prompt-response pairs that produces materially different percentages for reduction, preservation, and escalation.

Figures

Figures reproduced from arXiv: 2604.26052 by Mengya Hu, Qiong Wei, Sandeep Atluri.

**Figure 1.** Figure 1: Aggregate prompt→response max-severity transition matrix. Off-diagonal mass above the diagonal = de-escalation; below = escalation. Escalation audit. Manual inspection of the 40 escalation cases surfaces two recurring mechanisms: (1) unsolicited elaboration :a benign or lowharm prompt triggers a response that adds harmful detail not requested; (2) compliance escalation : an already-harmful prompt is answ… view at source ↗

**Figure 2.** Figure 2: Per-category 4×4 prompt→response severity transition matrices. Cells show raw counts; diagonal = severity preserved; above diagonal = de-escalation; below = escalation. Sexual content has the most mass in off-diagonal persistence cells at severity ≥ 1; Violence and Hate show more drift-up (bottom row, non-zero response severity). responses exceed 500 characters, 50.1% exceed 1,000, and 21.5% exceed 2,000, … view at source ↗

read the original abstract

Safety evaluations of large language models (LLMs) typically report binary outcomes, i.e. attack success rate (ASR), refusal rate, or harmful versus safe classification, which hide how risk changes between prompt and response. We present a paired analysis over human labeled prompt and response records across four harm categories (Sexual, Self harm, Hate and Violence) and ordinal severity levels (Safe, Low, Medium, High). 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate. The escalation splits into two mechanisms: benign prompts triggering unrequested harmful detail, and answers that stay on task at higher severity than the prompt. Category decomposition shows that Sexual content exhibits the highest harm persistence in this sample, driven by compliance at the same severity rather than drift from benign inputs. Joint relevance analysis exposes a helpfulness versus harmlessness tradeoff: compliance escalations remain highly relevant, whereas safe responses include generic refusals with low relevance. Finally, few-shot LLM graders exhibit a prompt/response detection asymmetry that data calibration does not close. Grader prompts are shared at https://github.com/microsoft/PairedSafety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paired ordinal tracking shows most LLM responses lower harm severity with only 3% escalation, but the numbers rest on human labels without reported consistency checks or sample details.

read the letter

The main thing to know is that this paper tracks how harm changes from prompt to response in LLMs using ordinal severity labels. They report that 61 percent of responses reduce harm, 36 percent keep the same level, and 3 percent escalate it. Sexual content shows the highest persistence, mostly from staying at the same severity rather than jumping up from safe prompts. What stands out as new is the decomposition of those escalations into two mechanisms. One is benign prompts that trigger unrequested harmful details. The other is responses that stay on task but at a higher severity than the prompt. They also bring in a relevance check that highlights a tradeoff: the escalating compliant responses tend to be highly relevant, while the safe ones are often generic and low-relevance. The work does a decent job extending binary safety metrics to this paired ordinal setup across four categories. Sharing the few-shot grader prompts on GitHub is a plus for anyone wanting to look at the detection asymmetry they found. The soft spots center on the human labeling. The abstract states the percentages and the category findings without giving sample size, inter-annotator agreement, or any calibration details for the severity scale. That leaves open whether the levels are comparable across categories like sexual versus violence. If not, the 61/36/3 split and the persistence claim lose some ground. The stress-test note points to exactly this issue, and it looks like a real gap from the provided text. This paper would suit readers working on LLM safety evaluations who want more than refusal rates. It gives a framework for severity tracking and mechanism analysis that could inform better benchmarks. It deserves a serious referee. The core observations are worth checking out even if the current presentation needs more on the methods. I would recommend sending it to peer review with a request to add the missing details on annotation reliability and dataset basics.

Referee Report

1 major / 2 minor

Summary. The manuscript conducts a paired analysis of human-labeled LLM prompts and responses across four harm categories (Sexual, Self-harm, Hate, Violence) using ordinal severity levels (Safe, Low, Medium, High). It reports that 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate, with escalation split into two mechanisms (benign prompts triggering unrequested harmful detail; on-task answers at higher severity). Sexual content shows highest harm persistence via compliance at the same severity; the work also examines relevance tradeoffs between helpfulness and harmlessness plus asymmetries in few-shot LLM graders, with grader prompts shared on GitHub.

Significance. If the human annotations are reliable and consistent, the paired prompt-response framing offers a useful advance over binary safety metrics by quantifying risk changes and identifying specific mechanisms and tradeoffs. The public release of grader prompts supports reproducibility. The central empirical counts could help prioritize safety interventions if the labeling foundation is strengthened.

major comments (1)

[Methods / Annotation] Methods section (annotation procedure): No sample size, inter-annotator agreement (e.g., Cohen's or Fleiss' kappa), exclusion criteria, calibration procedure, or cross-category anchoring for the ordinal severity labels is reported. The headline 61/36/3 split and category decompositions (including Sexual persistence driven by same-severity compliance) are direct arithmetic from these labels; without consistency checks the reduction/preservation/escalation rates and the two escalation mechanisms cannot be interpreted reliably.

minor comments (2)

[Abstract] Abstract and results: Adding the total number of prompt-response pairs would help readers assess the precision of the reported percentages.
[Results] Relevance analysis: The helpfulness-harmlessness tradeoff is noted but lacks detail on how relevance was defined or measured (e.g., annotation rubric or automated metric).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for acknowledging the potential value of the paired prompt-response framing over binary metrics. We address the major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Methods / Annotation] Methods section (annotation procedure): No sample size, inter-annotator agreement (e.g., Cohen's or Fleiss' kappa), exclusion criteria, calibration procedure, or cross-category anchoring for the ordinal severity labels is reported. The headline 61/36/3 split and category decompositions (including Sexual persistence driven by same-severity compliance) are direct arithmetic from these labels; without consistency checks the reduction/preservation/escalation rates and the two escalation mechanisms cannot be interpreted reliably.

Authors: We agree that these details are critical for assessing label reliability and thus the robustness of the reported 61/36/3 distribution and category-specific mechanisms. The submitted manuscript does not explicitly report sample size, inter-annotator agreement, exclusion criteria, calibration procedure, or cross-category anchoring. We will revise the Methods section to include this information from our annotation records: the total number of prompt-response pairs, any computed agreement metrics (such as Cohen's kappa for overlapping annotations), exclusion rules applied, the calibration process used for the ordinal scale, and how labels were anchored across the four harm categories. These additions will support interpretation of the reduction, preservation, and escalation rates without changing the core empirical results. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical counts from human labels

full rationale

The paper conducts a paired empirical analysis by comparing human-annotated ordinal severity levels (Safe, Low, Medium, High) between prompts and responses across four harm categories. The headline statistics (61% reduce harm, 36% preserve severity, 3% escalate) and category decompositions are computed directly from these label differences with no equations, fitted parameters, model derivations, or self-referential steps. No load-bearing claim reduces to a self-citation chain, ansatz, or input-by-construction; the results are independent observations from the labeled dataset and remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis depends on the reliability of human severity labels as the sole basis for all quantitative claims; no free parameters or new entities are introduced.

axioms (1)

domain assumption Human-provided ordinal severity labels for prompts and responses are reliable and consistent across annotators.
All reported percentages (61% reduce, 36% preserve, 3% escalate) and category decompositions rest directly on this labeling quality.

pith-pipeline@v0.9.0 · 5735 in / 1261 out tokens · 60671 ms · 2026-05-21T08:11:35.725557+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate... Category decomposition shows that Sexual content exhibits the highest harm persistence
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ordinal severity levels (Safe, Low, Medium, High) for both prompts and responses

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 5 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862. Bradley Efron and Robert J. Tibshirani. 1993.An Intro- duction to the Bootstrap. Chapman and Hall/CRC. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, ...

work page internal anchor Pith review Pith/arXiv arXiv 1993
[2]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Red teaming language models to reduce harms: Meth- ods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858. Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: LLM- based input-output safeguard for human-AI conver- sations.arXiv preprint arXiv:2312.06674. Yichao Ji

work page internal anchor Pith review Pith/arXiv arXiv
[4]

https://manus.im /blog/Context-Engineering-for-AI-Agents-L essons-from-Building-Manus

Context engineering for AI agents: Lessons from building Manus. https://manus.im /blog/Context-Engineering-for-AI-Agents-L essons-from-Building-Manus . Accessed: 2025- 07-18. Todor Markov, Chong Zhang, Sandhini Agarwal, Flo- rentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng

work page 2025
[5]

GPT-4 Technical Report

GPT-4 technical report.arXiv preprint arXiv:2303.08774. Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy

work page internal anchor Pith review Pith/arXiv arXiv
[6]

In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL), pages 5377–5400

XSTest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL), pages 5377–5400. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin

work page 2024
[7]

InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 896–911

Do-Not-Answer: Eval- uating safeguards in LLMs. InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 896–911. Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, ...

work page 2024
[8]

InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 214–229

Taxonomy of risks posed by language mod- els. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 214–229. Edwin B. Wilson

work page 2022
[9]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Univer- sal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043. A Case Examples Content warning:this appendix shows represen- tative prompts and responses involving harmful content categories. The marker [. . .] is usedonlyto indicate length-only omissions of contiguous spans of text in the longest response. Ex...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862. Bradley Efron and Robert J. Tibshirani. 1993.An Intro- duction to the Bootstrap. Chapman and Hall/CRC. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, ...

work page internal anchor Pith review Pith/arXiv arXiv 1993

[2] [2]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Red teaming language models to reduce harms: Meth- ods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858. Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: LLM- based input-output safeguard for human-AI conver- sations.arXiv preprint arXiv:2312.06674. Yichao Ji

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

https://manus.im /blog/Context-Engineering-for-AI-Agents-L essons-from-Building-Manus

Context engineering for AI agents: Lessons from building Manus. https://manus.im /blog/Context-Engineering-for-AI-Agents-L essons-from-Building-Manus . Accessed: 2025- 07-18. Todor Markov, Chong Zhang, Sandhini Agarwal, Flo- rentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng

work page 2025

[5] [5]

GPT-4 Technical Report

GPT-4 technical report.arXiv preprint arXiv:2303.08774. Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL), pages 5377–5400

XSTest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL), pages 5377–5400. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin

work page 2024

[7] [7]

InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 896–911

Do-Not-Answer: Eval- uating safeguards in LLMs. InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 896–911. Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, ...

work page 2024

[8] [8]

InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 214–229

Taxonomy of risks posed by language mod- els. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 214–229. Edwin B. Wilson

work page 2022

[9] [9]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Univer- sal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043. A Case Examples Content warning:this appendix shows represen- tative prompts and responses involving harmful content categories. The marker [. . .] is usedonlyto indicate length-only omissions of contiguous spans of text in the longest response. Ex...

work page internal anchor Pith review Pith/arXiv arXiv