Trojan activation attack: Red-teaming large language models using activation steering for safety-alignment

Haoran Wang, Kai Shu · 2023 · arXiv 2311.09433

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

AgentReview: Exploring Peer Review Dynamics with LLM Agents

cs.CL · 2024-06-18 · unverdicted · novelty 8.0

AgentReview is the first LLM-based simulation framework for peer review that quantifies a 37.1% decision variation attributable to reviewer biases.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

On the Privacy of LLMs: An Ablation Study

cs.CR · 2026-05-04 · unverdicted · novelty 4.0

Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

cs.CR · 2024-09-26 · unverdicted · novelty 2.0

Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

citing papers explorer

Showing 4 of 4 citing papers.

AgentReview: Exploring Peer Review Dynamics with LLM Agents cs.CL · 2024-06-18 · unverdicted · none · ref 51
AgentReview is the first LLM-based simulation framework for peer review that quantifies a 37.1% decision variation attributable to reviewer biases.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 194
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
On the Privacy of LLMs: An Ablation Study cs.CR · 2026-05-04 · unverdicted · none · ref 30
Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey cs.CR · 2024-09-26 · unverdicted · none · ref 155
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

Trojan activation attack: Red-teaming large language models using activation steering for safety-alignment

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer