Guardians and offenders: A survey on harmful content generation and safety mitigation of llm

Zhang, C · 2025 · arXiv 2508.05775

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Do Coding Agents Understand Least-Privilege Authorization?

cs.CR · 2026-05-14 · unverdicted · novelty 7.0

Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

cs.CL · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.

Why Do Large Language Models Generate Harmful Content?

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.

BarrierSteer: LLM Safety via Learning Barrier Steering

cs.LG · 2026-02-23 · unverdicted · novelty 6.0

BarrierSteer applies control barrier functions to LLM latent states for constraint-guided steering that reduces unsafe generations while preserving utility.

citing papers explorer

Showing 4 of 4 citing papers.

Do Coding Agents Understand Least-Privilege Authorization? cs.CR · 2026-05-14 · unverdicted · none · ref 53
Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue cs.CL · 2026-05-07 · unverdicted · none · ref 39 · 2 links
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
Why Do Large Language Models Generate Harmful Content? cs.AI · 2026-04-13 · unverdicted · none · ref 24
Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.
BarrierSteer: LLM Safety via Learning Barrier Steering cs.LG · 2026-02-23 · unverdicted · none · ref 24
BarrierSteer applies control barrier functions to LLM latent states for constraint-guided steering that reduces unsafe generations while preserving utility.

Guardians and offenders: A survey on harmful content generation and safety mitigation of llm

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer