{"total":18,"items":[{"citing_arxiv_id":"2606.05976","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Self-Correction Illusion: LLMs Correct Others but Not Themselves","primary_cat":"cs.AI","submitted_at":"2026-06-04T10:17:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30454","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Surface You Test Is Not the Surface That Breaks","primary_cat":"cs.CR","submitted_at":"2026-05-28T18:26:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prompt injection vulnerability in tool-augmented LLMs is a model-surface interaction rather than a fixed channel property; the same payload inverts success rates across models, and adaptive attack rate exceeds single-surface baselines by 9.1 pp on average.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19192","ref_index":8,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hallucination as Exploit: Evidence-Carrying Multimodal Agents","primary_cat":"cs.AI","submitted_at":"2026-05-18T23:40:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Evidence-carrying multimodal agents decompose tool calls into predicates, obtain certificates from DOM/OCR/AX verifiers, and use a deterministic gate to authorize actions only when certificates support them, achieving zero unsafe executions in tested tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07490","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cross-Modal Backdoors in Multimodal Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-08T09:29:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Poisoning a single connector in MLLMs establishes a reusable latent backdoor pathway that transfers across modalities with over 95% attack success rate under bounded perturbations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"instead assume that model parameters remain fixed and exploit the fragility of multimodal inputs or representations during deployment. Prior work has shown that multimodal systems can be manipulated at inference time without changing model parameters. Bagdasaryanet al.show that images and audio can deliver indirect instruction injection against MLLMs [ 8]. Zhaoet al.systematically evaluate the adversarial robustness of large vision-language models and show that carefully crafted visual perturbations can substantially alter downstream behavior [22]. More closely related to our setting, adversarial- alignment attacks show that the representation space itself can be steered. Carliniet al.study adversarial alignment in"},{"citing_arxiv_id":"2605.04261","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Laundering AI Authority with Adversarial Examples","primary_cat":"cs.CR","submitted_at":"2026-05-05T19:55:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Adversarial examples enable AI authority laundering by causing production VLMs to give authoritative but wrong responses on subtly perturbed images, with success rates of 22-100% using decade-old attack methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In all cases, our attacks weaponize the trustworthiness and truthfulness that AI assistants aim to promote to their users. Critically, authority laundering isnota misalignment attack. The model responds helpfully, harmlessly, and honestlyto what it (incorrectly) perceives. This distinguishes our threat model from jailbreaks [45, 69] and prompt injections [ 6, 59], which subvert the model's policy or instructions. It also makes alignment-based defenses (safety fine-tuning, refusal training) irrelevant against our attacks. The relevant-and largely unsolved-problem is adversarial robustness of visual representations. Mounting these attacks is alarmingly easy. Using only vanilla PGD [34] against an ensemble of publicly available CLIP models (a"},{"citing_arxiv_id":"2605.01449","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-02T13:56:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"attacker seeking reliable payload delivery would find the success rate uncompetitive with cheaper non-adversarial channels (typographic injection, social engineering). The artifacts are most useful for defenders building VLM-input filters and for researchers replicating or extending the methodology. References [1] Anthropic. Claude Opus 4.7 (1m context).https://www.anthropic.com/claude/opus, 2026. URLhttps://www.anthropic.com/claude/opus. [2] Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. Abusing images and sounds for indirect instruction injection in multi-modal LLMs, 2023. URLhttps://arxiv. org/abs/2307.10490. [3] Shuai Bai et al. Qwen2.5-VL technical report, 2025. URLhttps://arxiv.org/abs/2502. 13923. [4] Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons."},{"citing_arxiv_id":"2604.27267","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems","primary_cat":"cs.CR","submitted_at":"2026-04-29T23:44:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A unified threat model for LLM-enabled robots reveals three cross-boundary attack chains from user input to unsafe physical actuation due to missing validations and unmediated crossings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23374","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-04-25T16:39:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24790","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Semantic Denial of Service in LLM-controlled robots","primary_cat":"cs.CR","submitted_at":"2026-04-25T10:52:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Injecting brief safety-plausible phrases into robot audio triggers LLM safety halts, enabling semantic denial-of-service attacks where prompt defenses trade attack suppression for impaired genuine hazard detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21477","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks","primary_cat":"cs.CR","submitted_at":"2026-04-23T09:39:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MCP Pitfall Lab operationalizes six pitfall classes across tool-metadata poisoning, puppet servers, and multimodal chains, showing that recommended hardening removes all Tier-1 static findings and that agent narratives mismatch traces in 63% of tested runs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18860","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents","primary_cat":"cs.CR","submitted_at":"2026-04-20T21:36:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Desktop GUI agents face TOCTOU attacks from UI state changes during the ~6.5s observation-to-action gap, with a three-layer pre-execution verification defense achieving 100% interception on two attack types but failing on DOM injection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14604","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection","primary_cat":"cs.CR","submitted_at":"2026-04-16T04:22:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"highly context-sensitive, the attack must generalize across unknown user contexts to reliably trigger the target behavior. Third, it is difficult to achieve precise behavior control while maintaining perceptual stealth. Existing input-level mixing is easily detectable [19], whereas feature-level injection is ineffective due to the modality gap between audio and text [20]. A novel injection strategy is therefore required to reconcile attack imperceptibility with effectiveness. In this paper, we aim to realize a context-agnostic and imperceptible auditory prompt injection attack against LALMs to address these challenges. We first propose an output-level injection strategy based on audio adversarial ex- amples [21]-[25]."},{"citing_arxiv_id":"2512.12069","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring","primary_cat":"cs.CR","submitted_at":"2025-12-12T22:31:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.23883","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges","primary_cat":"cs.AI","submitted_at":"2025-10-27T21:48:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Beyond traditional text-based prompt injection, AI agents are increasingly vulnerable to attacks that exploit the growing capabilities of modern models, includingcode generation/execution, and multimodal understanding[ 79, 80, 61]. Adversaries can leverage agents' black box nature to human eyes, training methods, and interpretations of their inner logic, to conceal and inject instructions within images, sounds, or videos [81]. Naturally, as conventional filters and defenses are more general in nature, they fail to detect such agent-specific exploitations. We visualize these attacks in Figure 4 and discuss them in detail next. Text-Based InjectionWith the rise in popularity of LLM agents, text-based attacks can manifest in different forms. These range from direct prompt injection to code injection under the pretense of legitimate programming activities,"},{"citing_arxiv_id":"2503.06223","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion","primary_cat":"cs.CV","submitted_at":"2025-03-08T13:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA with transfer to other models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.05206","ref_index":293,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety","primary_cat":"cs.CR","submitted_at":"2025-02-02T05:14:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1) (i) White-Box: [264] [265] [266] [267] [268] (ii) Gray-Box: [269] (iii) Black-Box: [270] [271] [272] [273] Jailbreak Attacks (§ 5.2) (i) White-Box: [274] [275] [276] [277] [278] [279] (ii) Black-Box: [280] [281] [282] [283] [284] Jailbreak Defenses (§ 5.3) (i) Jailbreak Detection: [285] [286] (ii) Jailbreak Prevention: [287] [288] [289] [290] [291] Energy Latency Attacks (§ 5.4) (i) White-Box: [292] Prompt Injection Attack (§ 5.5) (i) White-Box: [293] (ii) Black-Box: [294] Backdoor & Poisoning Attacks (§ 5.6) (i) Backdoor: [295] [296] [297] [298] (i) Poisoning: [299] Diffusion Models (§ 6) Adversarial Attacks (§ 6.1) (i) White-Box: [300] [301] [302] (ii) Gray-Box: [303] [304] [305] [306] (iii) Black-Box: [307] [308] [309] [310] [311] [312] [313] Jailbreak Attacks (§ 6.2) (i) White-Box: [314] [315] [316] [317] (ii) Gray-Box: [318] [319]"},{"citing_arxiv_id":"2408.12935","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions","primary_cat":"cs.AI","submitted_at":"2024-08-23T09:33:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Unlike previous work that targets only a single modality, Yang et al. [791] have studied poisoning attacks against image and text encoders simultaneously, and observed significant attack performance. To covertly inject hidden malicious behaviors, backdoor injection methods on MLLMs are also explored. These methods steer the model to follow instructions embedded in the poisoned instruction tuning samples [34, 417, 418]. BadVLMDriver [529] highlights that MLLMs could be manipulated not only by typical backdoor attacks relying on digital modifications but also by physical objects. For instance, in the context of autonomous driving, a car could unexpectedly accelerate upon detecting a real trigger object due to the backdoor injection. To counter these backdoor"},{"citing_arxiv_id":"2402.06922","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Whispers in the Machine: Confidentiality in Agentic Systems","primary_cat":"cs.CR","submitted_at":"2024-02-10T11:07:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}