{"total":26,"items":[{"citing_arxiv_id":"2605.27766","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems","primary_cat":"cs.AI","submitted_at":"2026-05-26T23:32:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-agent social simulations show LLM privacy violations rising from 19.95% to 45.30%, with leakage spreading contagiously (8x after peer disclosure) and explicit instructions leaving rates above 37.8%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18988","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks","primary_cat":"cs.CR","submitted_at":"2026-05-18T18:06:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes the TRIAD framework that treats multi-turn multimodal attacks as continuous trajectories and uses structural anomaly detection, regularized Mahalanobis distance, topological acceleration, and a time-varying Cox model with Bayesian HMM feedback to predict and bound expected time-to-failure.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18239","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multilingual jailbreaking of LLMs using low-resource languages","primary_cat":"cs.CL","submitted_at":"2026-05-18T11:33:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multi-turn prompts in Afrikaans, Kiswahili, isiXhosa and isiZulu achieve 52-83% harmful response rates across GPT, Claude, Gemini and others, rising further with native-speaker red-teaming, showing translation quality limits jailbreak success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15598","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs","primary_cat":"cs.CR","submitted_at":"2026-05-15T04:14:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14418","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Great Pretender: A Stochasticity Problem in LLM Jailbreak","primary_cat":"cs.CR","submitted_at":"2026-05-14T06:05:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASR metrics for LLM jailbreaks are inflated by stochasticity; CAS-eval reveals up to 30pp drops under multi-attempt criteria while CAS-gen recovers the performance loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09225","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring","primary_cat":"cs.CR","submitted_at":"2026-05-09T23:51:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack success rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04019","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours","primary_cat":"cs.AI","submitted_at":"2026-05-05T17:43:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom human code.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02647","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming","primary_cat":"cs.CL","submitted_at":"2026-05-04T14:32:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"iteratively refine, mutate, or fuzz a single text artifact; even when that artifact embeds context or framing, it is ultimately submitted as one contiguous input. This excludes attacks in which aconver- sational trajectoryprogressively shapes the model's interpretive frame before the harmful request is issued. This limitation is visible in our experiments: the hand-designed multi-turnCrescendoat- tack [25], which uses no in-loop search, reaches 56% ASR@4 on the strongest open-source target we study, gpt-oss:120B. In contrast, hand-crafted human jailbreaks achieve only 8% ASR@4 and 6% ASR@5 on the same model, while theEncodingbaseline, which relies on surface-level obfuscation, remains below 2.5% ASR@4. The insight motivating this work is thatcontextual priming, the"},{"citing_arxiv_id":"2604.27861","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning","primary_cat":"cs.CR","submitted_at":"2026-04-30T13:44:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"to bypass safety alignment within a single prompt [32]. However, as safety filters have evolved to detect explicit malicious patterns, the threat landscape has shifted towards more sophisticated multi-turn strategies that exploit the model's context-following capabilities and the stateless nature of standard defenses. A pivotal precursor isCrescendo[ 19], which engages the model in a multi-turn conversation that begins with benign topics and imperceptibly escalates toward prohibited content. Although not strictly a semantic fragmentation attack, it bypasses guardrails that evaluate each turn in isolation by exploiting the cumulative toxicity of the dialogue history. Building upon this exploitation of context, recent research has"},{"citing_arxiv_id":"2604.21131","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms","primary_cat":"cs.CR","submitted_at":"2026-04-22T22:40:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces CSTM-Bench with 26 cross-session attack taxonomies, demonstrates recall loss in session-bound and full-log detectors, and proposes a bounded-memory coreset reader with the CSTM metric balancing detection and serving stability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11309","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems","primary_cat":"cs.CR","submitted_at":"2026-04-13T11:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"turn attacks that rely on a single crafted prompt, multi- turn jailbreak attacks unfold iteratively, mimicking real-world human conversations, making them more covert and persistent and thus exposing more critical security vulnerabilities of LLMs. Recent surges in multi-turn jailbreaking attacks have given rise to increasingly sophisticated strategies that exploit the contextual dynamics of LLMs [9], [10], [11], [12], [13], [14], [15]; these attacks are typically more effective than single-turn approaches and have led to numerous variations. MotivationExisting multi-turn jailbreak methods typically decompose a malicious query into several seemingly safe steps, yet they overlook a non-trivial detail: the malicious request must eventually be triggered in the final turn."},{"citing_arxiv_id":"2604.07727","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense","primary_cat":"cs.CR","submitted_at":"2026-04-09T02:22:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04759","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw","primary_cat":"cs.CR","submitted_at":"2026-04-06T15:27:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04060","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks","primary_cat":"cs.CR","submitted_at":"2026-04-05T11:06:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02652","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Generalization Limits of Reinforcement Learning Alignment","primary_cat":"cs.LG","submitted_at":"2026-04-03T02:32:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Compound jailbreaks raise attack success on aligned LLMs from 14.3% to 71.4%, providing evidence that safety training generalizes less broadly than model capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.21354","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project","primary_cat":"cs.LG","submitted_at":"2026-03-22T18:30:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"rule applied to conversation history. The enforcement mechanism is the sameasync_pre_call_hook used for tool catalog shaping: the router strips flagged turns from the messages array before forwarding to the new model. Feasibility and risk.SafetyL1 already produces continuous confidence scores; computing the EMA is trivial. The threat model is grounded in Crescendo attacks [70], which achieve 29-61% higher jailbreak rates than single- turn methods by gradually escalating across benign-looking turns. Recent proxy-level defenses validate the cumulative-scoring approach: a peak+accumulation formula combining single-turn peak risk, persistence ratio, and category diversity achieves 90.8% recall at 1.20% false-positive rate on 10,654 multi-turn conversations [71];"},{"citing_arxiv_id":"2512.22753","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software","primary_cat":"cs.SE","submitted_at":"2025-12-28T02:55:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12069","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring","primary_cat":"cs.CR","submitted_at":"2025-12-12T22:31:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.10100","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Robust AI Security and Alignment: A Sisyphean Endeavor?","primary_cat":"cs.AI","submitted_at":"2025-12-10T21:44:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"AI security and alignment cannot achieve full robustness because any sufficiently powerful AI inherits incompleteness-style limitations from formal systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.12710","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs","primary_cat":"cs.CL","submitted_at":"2025-11-16T17:52:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"by benchmarking EvoSynth against a diverse suite of 11 leading methods. Our selection spans foundational tech- niques, including optimization-based (PAIR [ 5], Auto- DAN [ 24]) and search-based (Tree of Attacks [ 31]) ap- proaches. We also include a comprehensive set of recent multi-turn and agent-based frameworks: ActorAttack [40], Chain of Attack (CoA)[ 54], Crescendo[ 41], RACE [ 55], AutoRedTeamer [66], AutoDan-Turbo [25], RainbowTeam- ing [ 42], and X-Teaming [ 39]. The baselines also em- ploy specialized methods such as CodeAttack [ 17] and RedQueen [19], thereby ensuring a challenging benchmark. A Note on Fair Comparison.Acknowledging that per- formance can be conflated with computational budget, we standardized the maximum resources available to each frame-"},{"citing_arxiv_id":"2511.02356","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs","primary_cat":"cs.CR","submitted_at":"2025-11-04T08:24:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.10546","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain","primary_cat":"cs.CL","submitted_at":"2025-09-07T22:35:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoRT achieves 95% average attack success rate on nine LLMs by using iterative risk-concealing prompts and a controller that scores concealment levels on a new 522-instruction financial risk benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.05367","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs","primary_cat":"cs.CR","submitted_at":"2025-09-04T05:53:20+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.00555","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Activation-Guided Local Editing for Jailbreaking Attacks","primary_cat":"cs.CR","submitted_at":"2025-08-01T11:52:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AGILE is a two-stage jailbreak attack that combines scenario-based rephrasing with activation-guided local editing to reach state-of-the-art attack success rates and strong black-box transferability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.06414","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking Misuse Mitigation Against Covert Adversaries","primary_cat":"cs.CR","submitted_at":"2025-06-06T17:33:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.04468","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks","primary_cat":"cs.AI","submitted_at":"2024-11-07T06:36:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Magentic-One is a modular multi-agent system that matches state-of-the-art performance on GAIA, AssistantBench, and WebArena using an orchestrator-led team of specialized agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}