pith. sign in

hub

Autodan: inter- pretable gradient-based adversarial attacks on large lan- guage models.arXiv preprint arXiv:2310.15140

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

hub tools

citation-role summary

background 3 method 1

citation-polarity summary

polarities

background 3 unclear 1

representative citing papers

Exploring the Secondary Risks of Large Language Models

cs.LG · 2025-06-14 · unverdicted · novelty 6.0

Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.

A StrongREJECT for Empty Jailbreaks

cs.LG · 2024-02-15 · conditional · novelty 6.0

StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.

Whispers in the Machine: Confidentiality in Agentic Systems

cs.CR · 2024-02-10 · unverdicted · novelty 6.0

Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.

citing papers explorer

Showing 16 of 16 citing papers.