Title resolution pending

Goldstein, J · 2023 · arXiv 2301.04246

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Who Owns This Agent? Tracing AI Agents Back to Their Owners

cs.CR · 2026-05-15 · unverdicted · novelty 8.0

A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.

Unsupervised Style Representation Learning for AI-Text Detection via Paraphrase Inversion

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

Unsupervised style representations learned via paraphrase inversion enable competitive few-shot and zero-shot AI-text detection with better generalization to unseen LLMs than supervised baselines.

Characterizing Opinion Evolution of Networked LLMs

cs.MA · 2026-06-05 · unverdicted · novelty 6.0

Modified classical opinion dynamics models with a bias term capture LLM network opinion evolution better than naive averaging, reducing mean opinion error by up to 88% and generalizing across models, topics, and networks.

Large Language Models Hack Rewards, and Society

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

LLMs discover regulatory loopholes in simulated societal environments through reward hacking during RL training.

The End of Trust: How Agentic AI Breaks Security Assumptions

cs.CR · 2026-05-14 · unverdicted · novelty 6.0

Agentic AI eliminates the fidelity-scale tradeoff in deception, enabling the Infinite Impostor attack that hijacks trusted relationships at mass scale and requiring a shift to suspect-by-default security based on evaluating actions rather than actors.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

Troll Farms

econ.TH · 2024-11-05 · unverdicted · novelty 6.0

A sender manipulates election outcomes via targeted uninformative messages that mimic exogenous voter signals, with influence rising in signal precision and falling in polarization; costly messaging leads to selective targeting.

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

cs.CL · 2023-10-03 · conditional · novelty 6.0

AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than baselines while bypassing perplexity defenses.

Jailbroken: How Does LLM Safety Training Fail?

cs.LG · 2023-07-05 · unverdicted · novelty 6.0

LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

cs.AI · 2023-03-31 · conditional · novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.

IDO: Incongruity-aware Distribution Optimization for Multimodal Fake News Detection

cs.CV · 2026-06-02 · unverdicted · novelty 5.0

IDO uses channel-wise reweighting, Gaussian modeling of factual uncertainty, and incongruity contrastive learning to achieve SOTA multimodal fake news detection.

The Future of Facts: Tracing the Factual Generation-Verification Gap

cs.CL · 2026-05-26 · unverdicted · novelty 5.0

Empirical tracing across model families shows verification precedes and outlasts generation for facts, with updates producing simultaneous verification of old and new answers.

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

cs.CL · 2025-08-28 · unverdicted · novelty 5.0

GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.

AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions

cs.AI · 2024-08-23 · unverdicted · novelty 4.0

The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.

Human-Centred Risk Mitigation for AI-Mediated Information Manipulation: A SOCMINT Framework Based on Information Manipulation Sets

cs.CY · 2026-06-08 · unverdicted · novelty 3.0

Proposes an IMS-based SOCMINT framework and pipeline for human-centred mitigation of AI-mediated information manipulation, building on existing VIGINUM/EEAS usage.

ClausewitzGPT Framework: A New Frontier in Theoretical Large Language Model Enhanced Information Operations

cs.CY · 2023-10-11 · unverdicted · novelty 2.0

Introduces the ClausewitzGPT equation as a mathematical formulation to quantify risks in LLM-augmented information operations, drawing on Clausewitz principles and emphasizing ethical autonomous AI agents.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Troll Farms econ.TH · 2024-11-05 · unverdicted · none · ref 28
A sender manipulates election outcomes via targeted uninformative messages that mimic exogenous voter signals, with influence rising in signal precision and falling in polarization; costly messaging leads to selective targeting.
AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions cs.AI · 2024-08-23 · unverdicted · none · ref 252
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer