{"total":51,"items":[{"citing_arxiv_id":"2605.29960","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction","primary_cat":"cs.CR","submitted_at":"2026-05-28T14:02:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28999","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening","primary_cat":"cs.CR","submitted_at":"2026-05-27T18:56:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Roughly 1% of real resumes contain hidden prompt injections against LLM screeners, prevalence has risen over 1-2 years, and over 90% avoid explicit instructions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24817","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RouteScan: A Non-Intrusive Approach to Auditing MoE LLMs Safety via Expert Routing Telemetry","primary_cat":"cs.CR","submitted_at":"2026-05-24T02:06:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RouteScan identifies malicious prompts in MoE LLMs using GPU expert routing telemetry as a privacy-preserving fingerprint, achieving AUROC above 0.93 on unseen harmful domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24552","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling","primary_cat":"cs.CR","submitted_at":"2026-05-23T12:39:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ellipsoid Control is a white-list test-time jailbreak defense that fits an anisotropic ellipsoid from benign activations to constrain projected gradient descent updates, aiming to improve the safety-utility tradeoff over black-list RepE methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24312","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Five Queries Are Enough: Query-Efficient and Surrogate-Free Membership Inference Attacks on RAG via Entailment","primary_cat":"cs.CR","submitted_at":"2026-05-23T00:38:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MEntA performs membership inference on RAG by measuring entailment between responses to five non-templated queries and candidate documents, reaching up to 0.991 AUC without surrogate models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21362","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-20T16:27:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20641","ref_index":71,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs","primary_cat":"cs.CR","submitted_at":"2026-05-20T02:55:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19966","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes","primary_cat":"cs.LG","submitted_at":"2026-05-19T15:15:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexity baselines across six open-weight chat models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19147","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks","primary_cat":"cs.CR","submitted_at":"2026-05-18T21:56:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OBBR projects poisoned samples into benign space via rewriting with open-book examples, raising safety performance by 51% on average versus prior defenses across five attacks and four LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15598","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs","primary_cat":"cs.CR","submitted_at":"2026-05-15T04:14:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11996","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts","primary_cat":"cs.AI","submitted_at":"2026-05-12T11:46:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"3) Victim Systems:We evaluateBadSKPon two represen- tative soft KG prompt systems,G-Retriever[6] andGNP[9]. To assess generalization across architectures, we consider four LLM backbones (LLaMA2-7B[41],LLaMA3-8B[42], Mistral-8B[43], andQwen3-8B[44]) and four GNN en- coders (GAT[45],GCN[46],Graph Transformer[47], and CGCNN[48]). We also evaluate the attack under a perplexity- based filtering defense [49], [50]. 4) Evaluation Metrics:We report two primary metrics. Accuracy (ACC)measures system utility and is computed as Hits@1 on clean, non-trigger queries.Attack Success Rate (ASR)measures attack effectiveness and is defined as the fraction of trigger-entity queries that elicit the attacker- specified response. Following the substring-matching con-"},{"citing_arxiv_id":"2605.10611","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Re-Triggering Safeguards within LLMs for Jailbreak Detection","primary_cat":"cs.CR","submitted_at":"2026-05-11T14:09:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08427","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play","primary_cat":"cs.AI","submitted_at":"2026-05-08T19:41:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[23] propose a pruning strategy to pre-select attacks with higher jailbreak potential. Hong et al. [13] instead optimise jointly for reward and novelty, incentivising the generation of both successful and previously unseen prompts in a curiosity-driven setting. Other lines of work focus on co-evolving attacker and defender policies to avoid static adversaries [ 17]. A central challenge in this setting is computational cost [14]. Xhonneux et al. [35] address this by operating in the continuous embedding space, reducing the expense associated with discrete token-level attacks. Liu et al. [22] introduce an online self-play framework in which a single model instantiates both attacker and defender, enabling mutual adaptation."},{"citing_arxiv_id":"2605.08277","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mitigating Many-shot Jailbreak Attacks with One Single Demonstration","primary_cat":"cs.CR","submitted_at":"2026-05-08T06:33:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"based jailbreaks span three regimes. Training-time defenses improve robustness by fine-tuning on adversarial data, but require parameter access and additional compute [ 19, 31, 33, 50, 8, 30]. Input-side defenses are training-free and black-box compatible, but often rely on heuristic filtering or prompt patches and may weaken under longer adversarial contexts [22, 49, 25, 37, 46]. Activation- level defenses [53] directly intervene on internal states but require white-box access. In contrast, our proposed SafeEnd uses a fixed one-shot safety intervention grounded in the fine-tuning-like drift view, requiring no training, parameter access, or per-instance prompt optimization. 3 Methodology This work aims to address the central question: why do MSJ attacks succeed, and how can they"},{"citing_arxiv_id":"2605.03378","ref_index":129,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection","primary_cat":"cs.CR","submitted_at":"2026-05-05T05:37:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03095","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses","primary_cat":"cs.CR","submitted_at":"2026-05-04T19:17:50+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01899","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-03T14:28:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"personas, and they lack a clear mechanistic explanation for their effectiveness. LLM safety alignment.Current safety alignment paradigms are largely built upon Reinforcement Learning from Human Feedback (RLHF) [3] and its efficient variants [21], such as Direct Preference Optimization (DPO) [22]. Defense mechanisms against jailbreak attacks include input preprocessing [ 23, 24, 25, 26, 27], output filtering [28, 29, 30], and robust prompt engineering [31]. Recently, adversarial self-play has emerged as a promising defense paradigm, where models iteratively discover and mitigate their own vulnerabilities. Representative approaches include SEAS [32], Self-RedTeam [33], STAIR [34], and MAGIC [35]. However, these methods primarily target instruction-level jailbreaks"},{"citing_arxiv_id":"2605.01078","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Sentence Relation-Based Approach to Sanitizing Malicious Instructions","primary_cat":"cs.CR","submitted_at":"2026-05-01T20:22:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SONAR constructs a relational graph from entailment and contradiction scores to prune injected malicious sentences from LLM prompts while preserving context, achieving near-zero attack success rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00741","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems","primary_cat":"cs.CR","submitted_at":"2026-05-01T15:42:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/energy reductions on testbed workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00236","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Attention Is Where You Attack","primary_cat":"cs.CR","submitted_at":"2026-04-30T21:15:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21700","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers","primary_cat":"cs.CR","submitted_at":"2026-04-23T14:08:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models while evading defenses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20930","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs","primary_cat":"cs.CR","submitted_at":"2026-04-22T09:49:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19657","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"An AI Agent Execution Environment to Safeguard User Data","primary_cat":"cs.CR","submitted_at":"2026-04-21T16:45:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"CR]https://arxiv.org/abs/ 2509.25926 [26] Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline Defenses for Ad- versarial Attacks Against Aligned Language Models.arXiv(2023). arXiv:2309.00614 [cs.LG]https://arxiv.org/abs/2309.00614 [27] Mintong Kang, Zhaorun Chen, and Bo Li. 2025. C-SafeGen: Certi- fied Safe LLM Generation with Claim-Based Streaming Guardrails. InNeurIPS. NeurIPS.https://neurips.cc/virtual/2025/loc/san-diego/ 14 poster/116139 [28] Darya Kaviani, Alp Eren Ozdarendeli, Jinhao Zhu, Yu Ding, and Raluca Ada Popa. 2026. Opal: Private Memory for Personal AI. (2026). arXiv:2604."},{"citing_arxiv_id":"2604.18874","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Adversarial Environments Mislead Agentic AI?","primary_cat":"cs.AI","submitted_at":"2026-04-20T21:53:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17769","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF","primary_cat":"cs.CL","submitted_at":"2026-04-20T03:49:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10326","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion","primary_cat":"cs.CR","submitted_at":"2026-04-11T19:19:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10134","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification","primary_cat":"cs.CR","submitted_at":"2026-04-11T09:59:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09544","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism","primary_cat":"cs.CL","submitted_at":"2026-04-10T17:58:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06247","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SALLIE: Safeguarding Against Latent Language & Image Exploits","primary_cat":"cs.CR","submitted_at":"2026-04-06T16:29:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01473","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits","primary_cat":"cs.CR","submitted_at":"2026-04-01T23:29:12+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.17368","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation","primary_cat":"cs.AI","submitted_at":"2026-03-18T05:21:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Safety degradation in large reasoning models occurs only after chain-of-thought is enabled; adding pre-CoT safety signals from a BERT classifier on safe models improves safety while preserving reasoning ability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.02280","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RACC: Representation-Aware Coverage Criteria for LLM Safety Testing","primary_cat":"cs.SE","submitted_at":"2026-02-02T16:20:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.23883","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges","primary_cat":"cs.AI","submitted_at":"2025-10-27T21:48:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"strings before or after instructions to manipulate LLM agent behavior. A number of different attack strategies have been proposed for undertaking IPI. Originally proposed for jailbreaking, Greedy Coordinate Gradient (GCG)[ 64] has been adapted to IPI by generating affirmative prefixes containing adversarial strings that induce malicious outputs from the agent [65]. In a similar fashion,two-stage GCG[ 66] trains a two-part adversarial string that is still effective after paraphrasing in order to get beyond defenses based on paraphrase detection. Lastly, as the aforementioned attacks often generate gibberish strings that can be easily detected via perplexity defenses,AutoDAN[67] enhances the semantic quality of adversarial strings to decrease detectability."},{"citing_arxiv_id":"2510.20129","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models","primary_cat":"cs.CR","submitted_at":"2025-10-23T02:07:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.20325","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs","primary_cat":"cs.CL","submitted_at":"2025-08-28T00:07:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.04204","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments","primary_cat":"cs.CL","submitted_at":"2025-08-06T08:35:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.06414","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking Misuse Mitigation Against Covert Adversaries","primary_cat":"cs.CR","submitted_at":"2025-06-06T17:33:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.04390","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG","primary_cat":"cs.CR","submitted_at":"2025-06-04T19:15:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces NPAS and AV Filter using LLM attention weights to defend RAG against poisoning, reporting up to 20% accuracy gains while adaptive attacks reach 35% success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.02546","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems","primary_cat":"cs.CR","submitted_at":"2025-06-03T07:32:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces six-dimension trustworthiness definition and attention-based A-Trust score with a TMS to improve LLM-MAS robustness against malicious or unreliable messages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.01770","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction","primary_cat":"cs.CR","submitted_at":"2025-06-02T15:17:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.19793","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Prompt Injection Attack to Tool Selection in LLM Agents","primary_cat":"cs.CR","submitted_at":"2025-04-28T13:36:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.05206","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety","primary_cat":"cs.CR","submitted_at":"2025-02-02T05:14:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Attacks and Defenses for SAM (§ 2.2) (i) Adversarial Attacks: [47] [48] [49] [50] [51] [52] [53] [54] (ii) Adversarial Defenses: [55] (iii) Backdoor & Poisoning Attacks: [56] [57] Large Language Models (§ 3) Adversarial Attack (§ 3.1) (i) White-Box: [58] [59] [60] [61] [62] [63] [64] (ii) Black-Box: [65] [66] [67] Adversarial Defense (§ 3.2) (i) Adversarial Detection: [68] [69] (ii) Robust Inference: [70] Jailbreak Attacks (§ 3.3) (i) Black-Box: [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [81] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] (ii) White-Box: [99] [100] Jailbreak Defenses (§ 3.4) (i) Input Defense: [101] [102] [103] [104] [105] [106] [107] [108] [109] (ii) Output Defense: [110] [111] [112] [113] [114]"},{"citing_arxiv_id":"2411.15594","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey on LLM-as-a-Judge","primary_cat":"cs.CL","submitted_at":"2024-11-23T16:03:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"involves the capacity to subsume under rules, that is, to distinguish whether something falls under a given rule. -- Kant,Critique of Judgment[59],Introduction IV, 5:179;Critique of Pure Reason[58],A132/B171. Recently, Large Language Models (LLMs) have achieved remarkable success across numerous domains [178], ranging from technical fields [142, 191, 210] to the humanities [55, 100, 113, 217] and social sciences [45, 127, 164, 177]. This growing interest stems from LLMs' ability to mimic human-like reasoning and thinking processes, enabling them to take on roles traditionally reserved for human experts while offering a cost-effective solution that can be effortlessly scaled to meet increasing evaluation demands. For instance, the use of LLM-as-a-Judge in academic peer review1"},{"citing_arxiv_id":"2410.15362","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models","primary_cat":"cs.LG","submitted_at":"2024-10-20T11:27:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Faster-GCG improves GCG efficiency 8x via regularization, temperature sampling, and duplicate avoidance, reaching 78.1% success rate with 32K evaluations across five aligned LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.04155","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toxic Subword Pruning for Dialogue Response Generation on Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-10-05T13:30:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToxPrune prunes toxic subwords from BPE tokenizers in LLMs to mitigate toxic dialogue responses and improve diversity on both toxic and non-toxic models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.18169","ref_index":74,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey","primary_cat":"cs.CR","submitted_at":"2024-09-26T17:55:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\" and they have provided \"several suggestions on how to do so, noting that some of our takeaways may resonate for pre-deployment safety evaluations more broadly.\" However, we note that the claims in this paper [ 3] has become very controversial among the community. 12 Mechanism study towards continual learning and safety fine-tuning of large language models is done by Jain et al. [ 74, 75] and anonymous [2], which might provide useful analysis tools for harmful fine-tuning. We summarize the existing mechanism study for harmful fine-tuning in Table 4. Table 4: Summary of harmful fine-tuning mechanism study. Study Key Findings First Available Leong et al. [83]The attack mechanisms of Explicit Harmful Attack and Identity-Shifting Attack are different."},{"citing_arxiv_id":"2407.04295","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Jailbreak Attacks and Defenses Against Large Language Models: A Survey","primary_cat":"cs.CR","submitted_at":"2024-07-05T06:57:30+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"level defense methods leave the prompts unchanged and fine-tune the language model to enhance the intrinsic safety guardrails so that the models decline to answer the harmful requests. 4.1 Prompt-level Defenses Prompt-level defenses refer to the scenarios where the direct access to neither the internal model weight nor the output 11 Jailbreak Defense Methods Prompt Level Prompt Detection [37] [1] Prompt Perturbation [11] [73] [38] [112] [45] [121] System Prompt Safeguard [77] [126] [94] [118] Model Level SFT-based [9] [18] [8] RLHF-based [66] [6] [83] [25] [59] [26] [58] Gradient and Logit Analysis [101] [102] [35] [53] Refinement [44] [113] Proxy Defense [110] [85] Figure 8: Taxonomy of jailbreak defense. logits is available, thus the prompt becomes the only vari-"},{"citing_arxiv_id":"2404.01318","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models","primary_cat":"cs.CR","submitted_at":"2024-03-28T02:44:02+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and defenses on LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.06922","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Whispers in the Machine: Confidentiality in Agentic Systems","primary_cat":"cs.CR","submitted_at":"2024-02-10T11:07:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.08419","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Jailbreaking Black Box Large Language Models in Twenty Queries","primary_cat":"cs.LG","submitted_at":"2023-10-12T15:38:28+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We believe that this is largely attributable to the fact that PAIR's prompts are semantic, and they therefore target similar vulnerabilities across LLMs, which are generally trained on similar datasets. 4.3 Defended performance of PAIR. In Table 5, we evaluate the performance of PAIR against two jailbreaking defenses: Smooth- LLM [20] and a perplexity filter [37, 38]. For SmoothLLM, we use N = 10 samples and a pertur- bation percentage of q = 10%; following [37], we set the threshold to be the maximum perplexity among the behaviors in JBB-Behaviors. Both defenses are evaluated statically, meaning that PAIR obtains prompts by attacking an undefended LLM, and then passes these prompts to a defended LLM. Notably, as shown in red, the JB% of PAIR drops significantly less than GCG when defended"}],"limit":50,"offset":0}