pith. sign in

Trojan activation attack: Red-teaming large language models using activation steering for safety-alignment

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

citation-role summary

background 2

citation-polarity summary

years

2026 1 2024 3

roles

background 2

polarities

background 2

representative citing papers

On the Privacy of LLMs: An Ablation Study

cs.CR · 2026-05-04 · unverdicted · novelty 4.0

Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.

citing papers explorer

Showing 4 of 4 citing papers.