{"total":342,"items":[{"citing_arxiv_id":"2606.27632","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety","primary_cat":"cs.CL","submitted_at":"2026-06-26T01:12:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26566","ref_index":83,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models","primary_cat":"cs.CR","submitted_at":"2026-06-25T03:32:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A narrative survey that catalogs fifty papers on diffusion-based adversarial techniques across text, vision, and vision-language models, proposes a six-class taxonomy of diffusion roles plus a unified five-dimension evaluation framework, and releases a companion catalog.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26396","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization","primary_cat":"cs.LG","submitted_at":"2026-06-24T21:26:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Sparse autoencoders show OOD prompts increase fallacious concept activation in transformers, offering a mechanistic measure of shift and a path to robust fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10106","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What makes a harness a harness: necessary and sufficient conditions for an agent harness","primary_cat":"cs.SE","submitted_at":"2026-06-08T19:35:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05976","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Self-Correction Illusion: LLMs Correct Others but Not Themselves","primary_cat":"cs.AI","submitted_at":"2026-06-04T10:17:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02211","ref_index":93,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Consistency Training while Mitigating Obfuscation via Rate Matching","primary_cat":"cs.CL","submitted_at":"2026-06-01T13:10:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00813","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs","primary_cat":"cs.CR","submitted_at":"2026-05-30T17:07:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Non-monotonic safety alignment appears in Gemma models, with Gemma 3 at 68.7% ASR versus 45.5% in Gemma 2 and 33.9% in Gemma 4 via MAP-Elites red-teaming and cross-generational attack transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00801","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety","primary_cat":"cs.CR","submitted_at":"2026-05-30T16:40:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Applies MAP-Elites quality-diversity optimization to evolve semantic attack strategies across dimensions like strategy type, encoding, and length, uncovering distinct vulnerability profiles in four LLMs including GPT-4o-mini and Claude 3.5 Sonnet.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00485","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Confused ChatGPT: Cross-App Context Poisoning via First-Party APIs","primary_cat":"cs.CR","submitted_at":"2026-05-30T02:37:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30883","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TRACE: Task-Aware Adaptive Self-Evolving Agentic Jailbreaking","primary_cat":"cs.CR","submitted_at":"2026-05-29T06:13:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TRACE is a task-aware adaptive self-evolving jailbreaking framework that achieves up to 100% bypass rates on LLM agents via subtask decomposition and scenario evolution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30693","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Triaging Threats to Specialized Guardrails","primary_cat":"cs.CR","submitted_at":"2026-05-29T00:36:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces GuardZoo benchmark and RouteGuard router-expert system showing monolithic guardrails suffer task interference while specialized routing improves threat detection and generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30454","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Surface You Test Is Not the Surface That Breaks","primary_cat":"cs.CR","submitted_at":"2026-05-28T18:26:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prompt injection vulnerability in tool-augmented LLMs is a model-surface interaction rather than a fixed channel property; the same payload inverts success rates across models, and adaptive attack rate exceeds single-surface baselines by 9.1 pp on average.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30189","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection","primary_cat":"cs.CR","submitted_at":"2026-05-28T16:32:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LoRA adapters can be reliably backdoored through training-data poisoning with token-level generalization, and both behavioral probe statistics and weight-norm statistics separate poisoned from clean adapters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29708","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-28T10:09:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Safety enforcement in aligned MoE LLMs is localized to specific experts and can be altered independently of the model's topic-driven routing patterns via a new red-teaming method called RASET.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29659","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content","primary_cat":"cs.LG","submitted_at":"2026-05-28T09:21:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29629","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures","primary_cat":"cs.AI","submitted_at":"2026-05-28T09:02:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TLO is a logit-based diagnostic that visualizes temporal patterns of LLM jailbreak failures on a calibrated 2D plane, distinguishing attacks with identical ASR and enabling early stopping that reduces successful jailbreaks by more than half.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29396","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-28T05:46:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29251","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Provably Secure Agent Guardrail","primary_cat":"cs.AI","submitted_at":"2026-05-28T02:12:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29224","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-28T01:23:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Web retrieval degrades safety alignment in LLM agents, with relevance activating vulnerabilities including a Safe Source Paradox where oppositional content increases harmful compliance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28914","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AIRGuard: Guarding Agent Actions with Runtime Authority Control","primary_cat":"cs.CR","submitted_at":"2026-05-27T17:48:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28734","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests","primary_cat":"cs.CR","submitted_at":"2026-05-27T16:55:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Consolidates eight corpora into a 6,671-prompt bank with five-judge consensus labels separating executable malicious code requests (4,748) from harmful security knowledge requests (1,923), achieving Fleiss' kappa 0.767.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28664","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection","primary_cat":"cs.LG","submitted_at":"2026-05-27T15:59:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28639","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Attentional White Bear Effect in Transformer Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-27T15:45:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prohibited concepts remain recoverable from hidden states, influence attention routing, and shape generations in transformers under instruction-based suppression.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28609","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"JECA^2: Judgment-Explanation Consistent Adversarial Attack against Forensic Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-27T15:24:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"JECA^2 is a new white-box attack method using Grad-CAM-guided perturbations and prompt embedding optimization to achieve judgment-explanation consistent adversarial attacks on forensic VLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28030","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection","primary_cat":"cs.LG","submitted_at":"2026-05-27T06:36:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPARD defends LLMs from harmful fine-tuning attacks via alternating safety projections and relevance-diversity DPP data selection, reporting lowest attack success rates on GSM8K and OpenBookQA while keeping task accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27819","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions","primary_cat":"cs.LG","submitted_at":"2026-05-27T01:27:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27763","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving","primary_cat":"cs.LG","submitted_at":"2026-05-26T23:22:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eliminating observed flips.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27157","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-26T15:18:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAG models exhibit a monitoring-control gap: they acknowledge epistemic conflicts in accumulating documents yet fail to constrain unsafe recommendations, with single-turn tests overestimating safety.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26999","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals","primary_cat":"cs.CL","submitted_at":"2026-05-26T13:19:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Prompt injection detection performance is highly regime-dependent with no single detector dominating across settings; transformer models perform best overall while structural signals offer modest gains in some regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26526","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks","primary_cat":"cs.LG","submitted_at":"2026-05-26T04:18:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Abliteration and prefilling attacks raise harm success rates on safeguarded open-weight LLMs from below 10% to 16-96% across three benchmarks, and a new ART tuning method reduces those rates by 10-20%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26409","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models","primary_cat":"cs.CR","submitted_at":"2026-05-26T00:36:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Behavioral geometry of model populations enables high-accuracy jailbreak susceptibility prediction and defense transfer with 98% fewer evaluations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23196","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers","primary_cat":"cs.CR","submitted_at":"2026-05-22T03:27:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces Prompt Overflow Attack that fragments malicious instructions in overlength prompts to evade guardrail segmentation while remaining actionable to LLMs with larger context windows.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22643","ref_index":102,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","primary_cat":"cs.CL","submitted_at":"2026-05-21T15:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22321","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions","primary_cat":"cs.CR","submitted_at":"2026-05-21T11:07:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A3S-Bench evaluates LLM agents against temporal, spatial, and semantic evasions, raising average risk trigger rates from 28.3% to 52.6% across 2,254 trajectories and 20 scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22258","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting","primary_cat":"cs.CL","submitted_at":"2026-05-21T10:01:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CITA generates Chinese implicit toxicity samples that cause 69.48% average missed detection across seven tested detectors while preserving harmfulness, and the same data improves robustness when used to fine-tune a CITD defense model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21948","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SCI-Defense: Defending Manipulation Attacks from Generative Engine Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-21T03:28:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SCI-Defense combines perplexity detection, semantic integrity scoring across four manipulation dimensions, and inter-candidate detection to counter GEO attacks, reporting perfect precision on Amazon product data but domain-limited recall on web passages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21834","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation","primary_cat":"cs.LG","submitted_at":"2026-05-20T23:56:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21706","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent-space Attacks for Refusal Evasion in Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-20T20:10:27+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21674","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adversarial Reframing: A Framework for Targeted Generation in Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-20T19:31:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"THREAT uses coordinated LLMs in an iterative optimization loop to generate jailbreak prompts that achieve higher success rates and lower detection rates than previous methods across tested models and datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21362","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-20T16:27:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20994","ref_index":117,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Context-Invariant Safety Alignment for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-20T10:33:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20759","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Fraud Safety Evaluation: Multi-Round Attacks Reveal Safety-Utility Tradeoffs in Graph-Context LLM Defenders","primary_cat":"cs.CR","submitted_at":"2026-05-20T05:59:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Graph-context LLM fraud defenders improve early refusal under replay and adaptive multi-round attacks compared to text baselines but increase benign over-refusal, with the cost localized to how the LLM consumes structured graph fields rather than encoder quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20654","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak","primary_cat":"cs.LG","submitted_at":"2026-05-20T03:16:15+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20641","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs","primary_cat":"cs.CR","submitted_at":"2026-05-20T02:55:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20519","ref_index":18,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Codec-Robust Attacks on Audio LLMs","primary_cat":"cs.SD","submitted_at":"2026-05-19T21:39:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20382","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-19T18:32:20+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20351","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)","primary_cat":"cs.CR","submitted_at":"2026-05-19T18:05:51+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19966","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes","primary_cat":"cs.LG","submitted_at":"2026-05-19T15:15:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexity baselines across six open-weight chat models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19940","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains","primary_cat":"cs.AI","submitted_at":"2026-05-19T15:00:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19722","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Measuring Safety Alignment Effects in Autonomous Security Agents","primary_cat":"cs.CR","submitted_at":"2026-05-19T11:55:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}