A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
Generative language models and automated influence operations: Emerging threats and potential mitigations.arXiv preprint arXiv:2301.04246, 1, 2023
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Agentic AI eliminates the fidelity-scale tradeoff in deception, enabling the Infinite Impostor attack that hijacks trusted relationships at mass scale and requiring a shift to suspect-by-default security based on evaluating actions rather than actors.
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
A sender manipulates election outcomes via targeted uninformative messages that mimic exogenous voter signals, with influence rising in signal precision and falling in polarization; costly messaging leads to selective targeting.
AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than baselines while bypassing perplexity defenses.
LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Empirical tracing across model families shows verification precedes and outlasts generation for facts, with updates producing simultaneous verification of old and new answers.
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.
Introduces the ClausewitzGPT equation as a mathematical formulation to quantify risks in LLM-augmented information operations, drawing on Clausewitz principles and emphasizing ethical autonomous AI agents.
citing papers explorer
-
Jailbroken: How Does LLM Safety Training Fail?
LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.