{"total":17,"items":[{"citing_arxiv_id":"2605.21674","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adversarial Reframing: A Framework for Targeted Generation in Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-20T19:31:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"THREAT uses coordinated LLMs in an iterative optimization loop to generate jailbreak prompts that achieve higher success rates and lower detection rates than previous methods across tested models and datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16471","ref_index":145,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI","primary_cat":"cs.CR","submitted_at":"2026-05-15T13:53:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"AudioJailbreak [22] achieves≥87% ASR in universal strong-adversary settings. Encoding-based.Transforming requests into non-standard representations exploits weaker safety coverage outside typicalnaturallanguage.CipherChat[147]reportsnear-100%bypassofGPT-4safetyviacipherencoding.Translation to low-resource languages increases bypass rates from<1% to 79% [145, 34]. ArtPrompt [65] uses ASCII art, and related work has shown that other non-standard representations such as Base64, ROT13, and Morse code similarly exploit weaker safety coverage in these encoding spaces. First Author et al.:Preprint submitted to ElsevierPage 9 of 25 Security and Safety Threats in Generative AI A cross-cutting finding from HarmBench is that robustness is shaped more by training data and algorithms than"},{"citing_arxiv_id":"2605.12705","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Early Data Exposure Improves Robustness to Subsequent Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:08:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10998","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs","primary_cat":"cs.CR","submitted_at":"2026-05-09T15:52:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24902","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains","primary_cat":"cs.CY","submitted_at":"2026-04-27T18:34:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Benign fine-tuning of foundation models induces large, heterogeneous, and often contradictory changes in safety metrics across general and domain-specific benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23338","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework","primary_cat":"cs.CR","submitted_at":"2026-04-25T14:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"cell assignments (7%) despite containing the threat classes with the longest detection latency by definition (T3/T4) and zero benchmark coverage for cross-session or weight-level attacks. TABLE V: Verifiable paper assignments for cells with ≤3 corpus papers (T3 and T4 columns only). All papers listed passed the PRISMA screening criteria in Section II. Cell Count Representative papers L1×T3 4 [34], [42]-[44] L1×T4 2 [34], [41] L2×T3 3 [32], [45], [46] L2×T4 3 [34], [41], [47] L3×T3 3 [39], [44], [48] L3×T4 2 [32], [44] L4×T3 3 [36], [39], [49] L4×T4 2 [45], [50] L5×T3 2 [12], [51] L5×T4 1 [52] L6×T3 1 [53] L6×T4 0 - L7×T3 1 [32] L7×T4 3 [34], [41], [46] V. L1-L2: FOUNDATION ANDCOGNITIVELAYERS L1 and L2 share the same neural substrate but admit attacks"},{"citing_arxiv_id":"2604.17396","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Representation-Guided Parameter-Efficient LLM Unlearning","primary_cat":"cs.CL","submitted_at":"2026-04-19T11:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17215","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Continual Safety Alignment via Gradient-Based Sample Selection","primary_cat":"cs.LG","submitted_at":"2026-04-19T02:52:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gradient-based selection that drops high-gradient samples during continual fine-tuning preserves safety alignment in LLMs better than standard fine-tuning while keeping task performance competitive.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07754","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training","primary_cat":"cs.CR","submitted_at":"2026-04-09T03:20:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in llms.CoRR abs/2308.13387, 2023. 3 [61] Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ram- nath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. A comprehensive survey of llm align- ment techniques: Rlhf, rlaif, ppo, dpo and more.CoRR abs/2407.16216, 2024. 11 [62] Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.CoRR abs/2310.02949, 2023. 11, 18 [63] Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. GPT- FUZZER: Red Teaming Large Language Models with Auto- Generated Jailbreak Prompts.CoRR abs/2309."},{"citing_arxiv_id":"2605.02914","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models","primary_cat":"cs.LG","submitted_at":"2026-04-08T05:27:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.08813","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Robust Policy Optimization to Prevent Catastrophic Forgetting","primary_cat":"cs.LG","submitted_at":"2026-02-09T15:50:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.05367","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs","primary_cat":"cs.CR","submitted_at":"2025-09-04T05:53:20+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.18169","ref_index":167,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey","primary_cat":"cs.CR","submitted_at":"2024-09-26T17:55:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and classified them into inherent issues, attacks, and unintended bugs. Then they gave a review on how Verification and Validation (V&V) techniques can be integrated to provide analysis to the safety of LLMs. Dong et al. [ 37] provides a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Yao et al. [167] study how LLMs positively impact security and privacy, potential risks and threats associated with their use, and inherent vulnerabilities within LLMs. Chua et al. [ 27] provides an up-to-date survey of recent trends in AI safety research. [ 101] covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social"},{"citing_arxiv_id":"2409.00557","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Ask: When LLM Agents Meet Unclear Instruction","primary_cat":"cs.CL","submitted_at":"2024-08-31T23:06:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.04295","ref_index":103,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Jailbreak Attacks and Defenses Against Large Language Models: A Survey","primary_cat":"cs.CR","submitted_at":"2024-07-05T06:57:30+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.","context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"the new token to the suffix using the Single Token Optimiza- tion (STO) algorithm that considers both jailbreak and read- ability objectives. In this way, the optimized suffix is se- 3 Jailbreak Attack Methods White-box Attack Gradient-based [125] [42] [124] [93] [2] [29] [34] [82] [95] [62] Logits-based [116] [31] [23] [117] [36] [123] Fine-tuning-based [68] [103] [47] [111] Black-box Attack Tamplate Completion Scenario Nesting [52] [22] [104] Context-based [100] [20] [48] [5] [120] Code Injection [43] [61] Prompt Rewriting Cipher [108] [40] [33] [55] [55] [13] Low-resource Languages [21] [106] [49] Genetic Algorithm-based [56] [46] [107] [50] [88] LLM-based Generation [19] [109] [76] [12] [15] [41] [27] [91] [54] [64]"},{"citing_arxiv_id":"2406.11717","ref_index":202,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Refusal in Language Models Is Mediated by a Single Direction","primary_cat":"cs.LG","submitted_at":"2024-06-17T16:36:12+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.08144","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM Agents can Autonomously Exploit One-day Vulnerabilities","primary_cat":"cs.CR","submitted_at":"2024-04-11T22:07:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}