super hub Mixed citations

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, J Zico Kolter, Milad Nasr, Nicholas Carlini, Zifan Wang · 2023 · cs.CL · arXiv 2307.15043

Mixed citation behavior. Most common role is background (65%).

398 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 398 citing papers more from Andy Zou arXiv PDF

abstract

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 37 dataset 6 method 5 baseline 2 other 2

citation-polarity summary

background 34 use dataset 6 unclear 4 use method 4 baseline 2 support 2

claims ledger

abstract Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached

authors

and Matt Fredrikson Andy Zou J Zico Kolter Milad Nasr Nicholas Carlini Zifan Wang

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

Confused ChatGPT: Cross-App Context Poisoning via First-Party APIs

cs.CR · 2026-05-30 · unverdicted · novelty 8.0

Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

cs.CR · 2026-05-24 · unverdicted · novelty 8.0

MemMorph poisons LLM agent long-term memory with three crafted records disguised as facts or policies to hijack tool selection, reaching 85.9% success rate across 10 backbones and outperforming baselines while resisting tested defenses.

Who Owns This Agent? Tracing AI Agents Back to Their Owners

cs.CR · 2026-05-15 · unverdicted · novelty 8.0

A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

cs.CR · 2026-05-11 · conditional · novelty 8.0

LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and exhibit execution hallucination.

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

cs.CR · 2026-04-17 · conditional · novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

cs.CR · 2026-04-09 · unverdicted · novelty 8.0

Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

cs.CR · 2026-04-03 · accept · novelty 8.0

Agent Skills has structural security weaknesses from missing data-instruction boundaries, single-approval persistent trust, and absent marketplace reviews that require fundamental redesign.

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

cs.CL · 2026-03-17 · conditional · novelty 8.0

Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.

A First Look at the Security Issues in the Model Context Protocol Ecosystem

cs.CR · 2025-10-18 · conditional · novelty 8.0

Analysis of 67,057 servers across six registries reveals widespread conditions for server hijacking and metadata manipulation in MCP, with a new tool MCPInspect flagging 833 vulnerable servers and 18 with suspicious descriptions.

Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem

cs.CR · 2025-09-08 · unverdicted · novelty 8.0

This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers revealing systemic vulnerabilities from missing isolation and least-privilege in the

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

cs.CL · 2023-08-02 · conditional · novelty 8.0

XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.

citing papers explorer

Showing 50 of 66 citing papers after filters.

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models cs.CL · 2026-03-17 · conditional · none · ref 10 · internal anchor
Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models cs.CL · 2023-08-02 · conditional · none · ref 5 · internal anchor
XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.
THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 19 · internal anchor
THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.
SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models cs.CL · 2026-05-25 · unverdicted · none · ref 15 · internal anchor
SomaliBench finds large English-to-Somali refusal gaps (0.38 to 0.90) across Llama-3.1-8B, Gemma-2-9B, Qwen-2.5-7B, and Aya-23-8B, with many Somali responses being unclear rather than compliant.
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions cs.CL · 2026-05-22 · unverdicted · none · ref 102 · internal anchor
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 102 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 41 · internal anchor
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs cs.CL · 2026-05-13 · conditional · none · ref 37 · internal anchor
LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.
Green Shielding: A User-Centric Approach Towards Trustworthy AI cs.CL · 2026-04-27 · unverdicted · none · ref 24 · internal anchor
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming cs.CL · 2026-04-21 · unverdicted · none · ref 78 · internal anchor
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF cs.CL · 2026-04-20 · unverdicted · none · ref 76 · internal anchor
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints cs.CL · 2026-04-18 · unverdicted · none · ref 3 · internal anchor
Forced-choice MCQs with only unsafe options bypass LLM safety refusals that work on equivalent open-ended prompts, with violation rates rising sharply under intermediate constraints and near saturation for model-generated MCQs.
EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading cs.CL · 2025-12-23 · unverdicted · none · ref 9 · internal anchor
A concept bottleneck model makes automated essay scoring transparent by evaluating eight rubric-aligned writing concepts before assigning grades.
Improving LLM Unlearning Robustness via Random Perturbations cs.CL · 2025-01-31 · unverdicted · none · ref 39 · internal anchor
LLM unlearning is reframed as inadvertently installing backdoor triggers on forget-tokens; Random Noise Augmentation is introduced as a defense that improves robustness with theoretical guarantees.
VoiceBench: Benchmarking LLM-Based Voice Assistants cs.CL · 2024-10-22 · unverdicted · none · ref 111 · internal anchor
VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation cs.CL · 2023-10-10 · conditional · none · ref 24 · internal anchor
Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming cs.CL · 2026-05-04 · unverdicted · none · ref 39
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG cs.CL · 2026-06-30 · unverdicted · none · ref 50 · internal anchor
The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.
Expert-Aware Refusal Steering cs.CL · 2026-06-02 · unverdicted · none · ref 38 · internal anchor
Refusal steering works on MoE LLMs; expert-aware variants succeed with single-expert outputs and refusal signals differ from routing patterns.
Consistency Training while Mitigating Obfuscation via Rate Matching cs.CL · 2026-06-01 · unverdicted · none · ref 93 · internal anchor
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
SentGuard: Sentence-Level Streaming Guardrails for Large Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 26 · internal anchor
SentGuard achieves 90.5% detection of unsafe cases within two sentences at 7.41% false positive rate by operating at sentence boundaries during LLM streaming generation.
Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs cs.CL · 2026-05-28 · unverdicted · none · ref 5 · internal anchor
Safety enforcement in aligned MoE LLMs is localized to specific experts and can be altered independently of the model's topic-driven routing patterns via a new red-teaming method called RASET.
Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents cs.CL · 2026-05-28 · unverdicted · none · ref 38 · internal anchor
Web retrieval degrades safety alignment in LLM agents, with relevance activating vulnerabilities including a Safe Source Paradox where oppositional content increases harmful compliance.
The Attentional White Bear Effect in Transformer Language Models cs.CL · 2026-05-27 · unverdicted · none · ref 14 · internal anchor
Prohibited concepts remain recoverable from hidden states, influence attention routing, and shape generations in transformers under instruction-based suppression.
Towards Context-Invariant Safety Alignment for Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 117 · internal anchor
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs cs.CL · 2026-05-19 · unverdicted · none · ref 24 · 2 links · internal anchor
LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.
Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models cs.CL · 2026-05-13 · conditional · none · ref 37 · internal anchor
Step-wise detection via a contrastive safety direction followed by remasking and adaptive steering reduces jailbreak success rates in diffusion language models to 0.64% while preserving output quality.
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations cs.CL · 2026-05-12 · unverdicted · none · ref 7 · 2 links · internal anchor
REALISTA generates semantically coherent adversarial prompts via latent-space optimization over input-dependent editing directions, achieving stronger hallucination elicitation than prior realistic attacks on open-source and reasoning LLMs.
LLM-Agnostic Semantic Representation Attack cs.CL · 2026-05-09 · unverdicted · none · ref 14 · internal anchor
SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue cs.CL · 2026-05-07 · unverdicted · none · ref 46 · 2 links · internal anchor
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs cs.CL · 2026-05-06 · unverdicted · none · ref 8 · 2 links · internal anchor
LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety cs.CL · 2026-05-03 · unverdicted · none · ref 16 · internal anchor
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
A Theoretical Game of Attacks via Compositional Skills cs.CL · 2026-05-01 · unverdicted · none · ref 18 · internal anchor
A theoretical attacker-defender game in LLM adversarial prompting yields a best-response attack related to existing methods, reveals attacker advantages at equilibrium, and derives a provably optimal defense with stronger empirical performance.
Test-Time Safety Alignment cs.CL · 2026-04-28 · unverdicted · none · ref 52 · internal anchor
Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model cs.CL · 2026-04-28 · unverdicted · none · ref 8 · 2 links · internal anchor
Paired prompt-response analysis shows 61% of LLM responses reduce harm severity, 36% preserve it, and 3% escalate, with Sexual content showing highest persistence and LLM graders exhibiting detection asymmetry.
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks cs.CL · 2026-04-20 · unverdicted · none · ref 30 · internal anchor
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
Representation-Guided Parameter-Efficient LLM Unlearning cs.CL · 2026-04-19 · unverdicted · none · ref 80 · internal anchor
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism cs.CL · 2026-04-10 · unverdicted · none · ref 27 · internal anchor
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies cs.CL · 2026-04-10 · unverdicted · none · ref 8 · internal anchor
LLMs display systematic, architecture-dependent gaps between their self-stated safety policies and observed behavior on harmful prompts, with absolute refusal claims frequently violated.
Exclusive Unlearning cs.CL · 2026-04-07 · unverdicted · none · ref 20 · internal anchor
Exclusive Unlearning makes LLMs safe by forgetting all but retained domain knowledge, protecting against jailbreaks while preserving useful responses in areas like medicine and math.
DRIV-EX: Counterfactual Explanations for Driving LLMs cs.CL · 2026-02-28 · unverdicted · none · ref 13 · internal anchor
DRIV-EX generates fluent counterfactual scene descriptions by using gradient-optimized embeddings only as a guide for controlled text decoding, producing more reliable explanations than baselines on transcribed highD driving data.
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs cs.CL · 2025-11-16 · unverdicted · none · ref 70 · internal anchor
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking cs.CL · 2025-10-11 · conditional · none · ref 5 · internal anchor
ADMIT achieves 86% average attack success rate on RAG fact-checking at 0.93×10^{-6} poisoning rate across 4 retrievers, 11 LLMs, and 4 benchmarks while remaining robust to counter-evidence.
Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts cs.CL · 2025-10-08 · unverdicted · none · ref 3 · internal anchor
Red-Bandit adapts online to LLM failure modes by dynamically selecting among RL-trained LoRA attack-style experts via a bandit policy, reporting SOTA ASR@10 on AdvBench with lower-perplexity prompts.
Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models cs.CL · 2025-10-04 · unverdicted · none · ref 16 · internal anchor
Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.
Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain cs.CL · 2025-09-07 · unverdicted · none · ref 44 · internal anchor
CoRT achieves 95% average attack success rate on nine LLMs by using iterative risk-concealing prompts and a controller that scores concealment levels on a new 522-instruction financial risk benchmark.
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal cs.CL · 2025-09-07 · unverdicted · none · ref 38 · internal anchor
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments cs.CL · 2025-08-06 · unverdicted · none · ref 34 · internal anchor
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs cs.CL · 2025-05-22 · unverdicted · none · ref 46 · internal anchor
Machine unlearning in LLMs is often reversible via fine-tuning, indicating suppression not deletion, and a new representation-level framework identifies four forgetting regimes based on reversibility and catastrophicity.
Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs cs.CL · 2025-05-20 · unverdicted · none · ref 29 · internal anchor
Phonetic perturbations fragment safety-critical tokens in LLMs, suppressing attribution scores while preserving input understanding and causing safety mechanisms to fail despite good comprehension.

Universal and Transferable Adversarial Attacks on Aligned Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer