pith. sign in

arxiv: 2604.26052 · v3 · pith:LYYKRBJUnew · submitted 2026-04-28 · 💻 cs.CL

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

Pith reviewed 2026-05-21 08:11 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM safety evaluationpaired prompt response analysisharm severity levelsescalation mechanismshelpfulness harmlessness tradeoffordinal harm labeling
0
0 comments X

The pith

Paired analysis shows 61% of LLM responses reduce harm from the input prompt while 3% escalate it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a paired comparison of human-labeled prompts and responses across four harm categories using ordinal severity levels. It establishes concrete rates at which responses lower, maintain, or raise harm relative to the prompt, along with the specific mechanisms driving the small escalation share and category differences. A sympathetic reader would care because binary safety metrics miss these dynamics, and the paired view reveals where compliance and relevance trade off against harm reduction. The analysis also shows that few-shot LLM-based grading exhibits an asymmetry between prompt and response detection.

Core claim

The central claim is that safety evaluations must move from isolated prompt or response classification to paired prompt-response records; when this is done with ordinal severity labels, 61% of responses reduce harm, 36% preserve severity, and 3% escalate, with escalation occurring either through unrequested harmful detail on benign prompts or through on-task answers at higher severity, and with Sexual content showing the highest persistence via same-severity compliance.

What carries the argument

Paired ordinal severity labeling (Safe, Low, Medium, High) of both prompt and response, which tracks harm change rather than binary outcomes.

If this is right

  • Most LLM outputs decrease the harm severity of the original prompt.
  • Escalations arise from either adding unrequested harmful content or increasing severity while remaining on task.
  • Sexual content shows the highest harm persistence, driven by compliance at the same severity level.
  • Safe refusals tend to have low relevance, exposing a helpfulness-harmlessness tradeoff.
  • Few-shot LLM graders detect risk more readily in prompts than in responses even after calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The paired method could be reused on newer models or additional categories to measure whether safety training reduces escalation rates over time.
  • Alignment efforts might target the two distinct escalation mechanisms separately rather than treating all harm uniformly.
  • Better refusal templates could aim to preserve relevance while still lowering severity, addressing the observed tradeoff.

Load-bearing premise

Human annotations assign accurate and consistent ordinal severity levels to prompts and responses across all categories.

What would settle it

Independent re-labeling of the same prompt-response pairs that produces materially different percentages for reduction, preservation, and escalation.

Figures

Figures reproduced from arXiv: 2604.26052 by Mengya Hu, Qiong Wei, Sandeep Atluri.

Figure 1
Figure 1. Figure 1: Aggregate prompt→response max-severity transition matrix. Off-diagonal mass above the diagonal = de-escalation; below = escalation. Escalation audit. Manual inspection of the 40 escalation cases surfaces two recurring mecha￾nisms: (1) unsolicited elaboration :a benign or low￾harm prompt triggers a response that adds harmful detail not requested; (2) compliance escalation : an already-harmful prompt is answ… view at source ↗
Figure 2
Figure 2. Figure 2: Per-category 4×4 prompt→response severity transition matrices. Cells show raw counts; diagonal = severity preserved; above diagonal = de-escalation; below = escalation. Sexual content has the most mass in off-diagonal persistence cells at severity ≥ 1; Violence and Hate show more drift-up (bottom row, non-zero response severity). responses exceed 500 characters, 50.1% exceed 1,000, and 21.5% exceed 2,000, … view at source ↗
read the original abstract

Safety evaluations of large language models (LLMs) typically report binary outcomes, i.e. attack success rate (ASR), refusal rate, or harmful versus safe classification, which hide how risk changes between prompt and response. We present a paired analysis over human labeled prompt and response records across four harm categories (Sexual, Self harm, Hate and Violence) and ordinal severity levels (Safe, Low, Medium, High). 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate. The escalation splits into two mechanisms: benign prompts triggering unrequested harmful detail, and answers that stay on task at higher severity than the prompt. Category decomposition shows that Sexual content exhibits the highest harm persistence in this sample, driven by compliance at the same severity rather than drift from benign inputs. Joint relevance analysis exposes a helpfulness versus harmlessness tradeoff: compliance escalations remain highly relevant, whereas safe responses include generic refusals with low relevance. Finally, few-shot LLM graders exhibit a prompt/response detection asymmetry that data calibration does not close. Grader prompts are shared at https://github.com/microsoft/PairedSafety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript conducts a paired analysis of human-labeled LLM prompts and responses across four harm categories (Sexual, Self-harm, Hate, Violence) using ordinal severity levels (Safe, Low, Medium, High). It reports that 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate, with escalation split into two mechanisms (benign prompts triggering unrequested harmful detail; on-task answers at higher severity). Sexual content shows highest harm persistence via compliance at the same severity; the work also examines relevance tradeoffs between helpfulness and harmlessness plus asymmetries in few-shot LLM graders, with grader prompts shared on GitHub.

Significance. If the human annotations are reliable and consistent, the paired prompt-response framing offers a useful advance over binary safety metrics by quantifying risk changes and identifying specific mechanisms and tradeoffs. The public release of grader prompts supports reproducibility. The central empirical counts could help prioritize safety interventions if the labeling foundation is strengthened.

major comments (1)
  1. [Methods / Annotation] Methods section (annotation procedure): No sample size, inter-annotator agreement (e.g., Cohen's or Fleiss' kappa), exclusion criteria, calibration procedure, or cross-category anchoring for the ordinal severity labels is reported. The headline 61/36/3 split and category decompositions (including Sexual persistence driven by same-severity compliance) are direct arithmetic from these labels; without consistency checks the reduction/preservation/escalation rates and the two escalation mechanisms cannot be interpreted reliably.
minor comments (2)
  1. [Abstract] Abstract and results: Adding the total number of prompt-response pairs would help readers assess the precision of the reported percentages.
  2. [Results] Relevance analysis: The helpfulness-harmlessness tradeoff is noted but lacks detail on how relevance was defined or measured (e.g., annotation rubric or automated metric).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for acknowledging the potential value of the paired prompt-response framing over binary metrics. We address the major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Methods / Annotation] Methods section (annotation procedure): No sample size, inter-annotator agreement (e.g., Cohen's or Fleiss' kappa), exclusion criteria, calibration procedure, or cross-category anchoring for the ordinal severity labels is reported. The headline 61/36/3 split and category decompositions (including Sexual persistence driven by same-severity compliance) are direct arithmetic from these labels; without consistency checks the reduction/preservation/escalation rates and the two escalation mechanisms cannot be interpreted reliably.

    Authors: We agree that these details are critical for assessing label reliability and thus the robustness of the reported 61/36/3 distribution and category-specific mechanisms. The submitted manuscript does not explicitly report sample size, inter-annotator agreement, exclusion criteria, calibration procedure, or cross-category anchoring. We will revise the Methods section to include this information from our annotation records: the total number of prompt-response pairs, any computed agreement metrics (such as Cohen's kappa for overlapping annotations), exclusion rules applied, the calibration process used for the ordinal scale, and how labels were anchored across the four harm categories. These additions will support interpretation of the reduction, preservation, and escalation rates without changing the core empirical results. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical counts from human labels

full rationale

The paper conducts a paired empirical analysis by comparing human-annotated ordinal severity levels (Safe, Low, Medium, High) between prompts and responses across four harm categories. The headline statistics (61% reduce harm, 36% preserve severity, 3% escalate) and category decompositions are computed directly from these label differences with no equations, fitted parameters, model derivations, or self-referential steps. No load-bearing claim reduces to a self-citation chain, ansatz, or input-by-construction; the results are independent observations from the labeled dataset and remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis depends on the reliability of human severity labels as the sole basis for all quantitative claims; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Human-provided ordinal severity labels for prompts and responses are reliable and consistent across annotators.
    All reported percentages (61% reduce, 36% preserve, 3% escalate) and category decompositions rest directly on this labeling quality.

pith-pipeline@v0.9.0 · 5735 in / 1261 out tokens · 60671 ms · 2026-05-21T08:11:35.725557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862. Bradley Efron and Robert J. Tibshirani. 1993.An Intro- duction to the Bootstrap. Chapman and Hall/CRC. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, ...

  2. [2]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Red teaming language models to reduce harms: Meth- ods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858. Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri

  3. [3]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Llama guard: LLM- based input-output safeguard for human-AI conver- sations.arXiv preprint arXiv:2312.06674. Yichao Ji

  4. [4]

    https://manus.im /blog/Context-Engineering-for-AI-Agents-L essons-from-Building-Manus

    Context engineering for AI agents: Lessons from building Manus. https://manus.im /blog/Context-Engineering-for-AI-Agents-L essons-from-Building-Manus . Accessed: 2025- 07-18. Todor Markov, Chong Zhang, Sandhini Agarwal, Flo- rentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng

  5. [5]

    GPT-4 Technical Report

    GPT-4 technical report.arXiv preprint arXiv:2303.08774. Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy

  6. [6]

    In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL), pages 5377–5400

    XSTest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL), pages 5377–5400. Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin

  7. [7]

    InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 896–911

    Do-Not-Answer: Eval- uating safeguards in LLMs. InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 896–911. Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, ...

  8. [8]

    InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 214–229

    Taxonomy of risks posed by language mod- els. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 214–229. Edwin B. Wilson

  9. [9]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Univer- sal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043. A Case Examples Content warning:this appendix shows represen- tative prompts and responses involving harmful content categories. The marker [. . .] is usedonlyto indicate length-only omissions of contiguous spans of text in the longest response. Ex...