super hub Mixed citations

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, J Zico Kolter, Milad Nasr, Nicholas Carlini, Zifan Wang · 2023 · cs.CL · arXiv 2307.15043

Mixed citation behavior. Most common role is background (65%).

438 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 438 citing papers more from Andy Zou arXiv PDF

abstract

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 37 dataset 6 method 5 baseline 2 other 2

citation-polarity summary

background 34 use dataset 6 unclear 4 use method 4 baseline 2 support 2

claims ledger

abstract Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached

authors

and Matt Fredrikson Andy Zou J Zico Kolter Milad Nasr Nicholas Carlini Zifan Wang

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

Confused ChatGPT: Cross-App Context Poisoning via First-Party APIs

cs.CR · 2026-05-30 · unverdicted · novelty 8.0

Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

cs.CR · 2026-05-24 · unverdicted · novelty 8.0

MemMorph poisons LLM agent long-term memory with three crafted records disguised as facts or policies to hijack tool selection, reaching 85.9% success rate across 10 backbones and outperforming baselines while resisting tested defenses.

Who Owns This Agent? Tracing AI Agents Back to Their Owners

cs.CR · 2026-05-15 · unverdicted · novelty 8.0

A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

cs.CR · 2026-05-11 · conditional · novelty 8.0

LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and exhibit execution hallucination.

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

cs.CR · 2026-04-17 · conditional · novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

cs.CR · 2026-04-09 · unverdicted · novelty 8.0

Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

cs.CR · 2026-04-03 · accept · novelty 8.0

Agent Skills has structural security weaknesses from missing data-instruction boundaries, single-approval persistent trust, and absent marketplace reviews that require fundamental redesign.

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

cs.CL · 2026-03-17 · conditional · novelty 8.0

Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.

A First Look at the Security Issues in the Model Context Protocol Ecosystem

cs.CR · 2025-10-18 · conditional · novelty 8.0

Analysis of 67,057 servers across six registries reveals widespread conditions for server hijacking and metadata manipulation in MCP, with a new tool MCPInspect flagging 833 vulnerable servers and 18 with suspicious descriptions.

Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem

cs.CR · 2025-09-08 · unverdicted · novelty 8.0

This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers revealing systemic vulnerabilities from missing isolation and least-privilege in the

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

cs.CL · 2023-08-02 · conditional · novelty 8.0

XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.

citing papers explorer

Showing 50 of 73 citing papers after filters.

Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing cs.LG · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.
Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies cs.LG · 2026-06-28 · unverdicted · none · ref 10 · internal anchor
SMDA fits ridge regression on SAE features to distill symbolic policies then decomposes each SFT example's influence via feature-activation and output-probability deltas, demonstrated on refusal behavior in Llama-3.2-3B-Instruct.
Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers cs.LG · 2026-06-10 · unverdicted · none · ref 43 · 2 links · internal anchor
An online KS-statistic monitor detects shifts in deployed safety classifiers with 86.6% valid detection rate, exposes conformal prediction collapse in high-dimensional embeddings, and derives a confidence-gated security boundary against adaptive attackers.
Gate AI: LLM Security Benchmark Evaluation Methodology and Results cs.LG · 2026-06-01 · unverdicted · none · ref 14 · internal anchor
Introduces a cross-validation-based evaluation methodology for LLM security detectors using a global threshold and group-fold leakage checks to avoid per-dataset tuning.
ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions cs.LG · 2026-05-27 · unverdicted · none · ref 8 · internal anchor
ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.
Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes cs.LG · 2026-05-19 · conditional · none · ref 3 · internal anchor
CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexity baselines across six open-weight chat models.
Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing cs.LG · 2026-05-18 · unverdicted · none · ref 20 · internal anchor
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
Widening the Gap: Exploiting LLM Quantization via Outlier Injection cs.LG · 2026-05-14 · conditional · none · ref 45 · internal anchor
The paper introduces an outlier-injection attack that induces targeted weight collapse in LLMs under advanced quantization schemes including AWQ, GPTQ, and GGUF I-quants.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation cs.LG · 2026-05-07 · unverdicted · none · ref 37 · internal anchor
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
On the Hardness of Junking LLMs cs.LG · 2026-05-06 · unverdicted · none · ref 65 · internal anchor
Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 85 · internal anchor
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models cs.LG · 2026-04-20 · unverdicted · none · ref 44 · internal anchor
SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning performance within 1.5 points.
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory cs.LG · 2026-04-14 · unverdicted · none · ref 27 · internal anchor
Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 208 · internal anchor
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Scalable Extraction of Training Data from (Production) Language Models cs.LG · 2023-11-28 · conditional · none · ref 56 · internal anchor
Adversaries can scalably extract gigabytes of training data from open, semi-open, and closed language models via querying attacks, including a divergence method that increases extraction rates 150x on aligned models like ChatGPT.
Addressing Over-Refusal in LLMs with Competing Rewards cs.LG · 2026-06-30 · unverdicted · none · ref 31 · internal anchor
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
Quality Is Not a Safety Proxy Under Quantization cs.LG · 2026-06-08 · conditional · none · ref 45 · internal anchor
Across 51 quantized checkpoints, quality metrics fail to predict safety drops in 36 pairings and 10 hidden-danger cases, while a new RTSI screen routes all 10 dangerous rows to testing at matched bucket size.
When Behavioral Safety Evaluation Fails: A Representation-Level Perspective cs.LG · 2026-06-06 · unverdicted · none · ref 21 · internal anchor
Behavioral safety metrics for LLMs are insufficient because models can maintain safe outputs while remaining vulnerable to latent-space interventions, as shown via dissociated models and the new Latent Vulnerability Score.
Sequential Data Poisoning in LLM Post-Training cs.LG · 2026-06-03 · unverdicted · none · ref 16 · internal anchor
Multiple adversaries poisoning different stages of LLM post-training produce additive or complementary effects that single-stage analyses underestimate.
FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses cs.LG · 2026-06-03 · unverdicted · none · ref 13 · internal anchor
FailureScope clusters evaluation probes by cross-model failure patterns via LOMO to produce stable taxonomies that generalize across single-turn, multi-turn, and adversarial regimes, with reported metrics of Kendall's tau 0.81 and AUC 0.88.
Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection cs.LG · 2026-05-27 · unverdicted · none · ref 59 · internal anchor
Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving cs.LG · 2026-05-26 · unverdicted · none · ref 28 · internal anchor
The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eliminating observed flips.
Curriculum Learning for Safety Alignment cs.LG · 2026-05-25 · unverdicted · none · ref 29 · internal anchor
Staged-Competence curriculum reduces out-of-distribution harmful responses by 16% and jailbreak success rates by 20% in DPO safety alignment across three model families while using 75% of the data.
On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation cs.LG · 2026-05-20 · conditional · none · ref 37 · internal anchor
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
The Evaluation Game: Beyond Static LLM Benchmarking cs.LG · 2026-05-19 · unverdicted · none · ref 60 · internal anchor
Presents a game-theoretic model with group actions for data augmentation in LLM adversarial evaluation, demonstrating local generalization from fine-tuning on three model families and redefining benchmarks as orbits under group actions.
Alignment Dynamics in LLM Fine-Tuning cs.LG · 2026-05-18 · unverdicted · none · ref 43 · internal anchor
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing cs.LG · 2026-05-15 · unverdicted · none · ref 240 · internal anchor
Recasts sampling-based nonconvex optimization as smoothed gradient descent to obtain non-asymptotic convergence guarantees and introduces the DIDA annealed algorithm that converges to the global optimum.
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling cs.LG · 2026-05-14 · conditional · none · ref 33 · internal anchor
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation cs.LG · 2026-05-14 · conditional · none · ref 42 · internal anchor
On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 210 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Before the Last Token: Diagnosing Final-Token Safety Probe Failures cs.LG · 2026-05-12 · unverdicted · none · ref 16 · internal anchor
Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation cs.LG · 2026-05-12 · unverdicted · none · ref 33 · internal anchor
PCAP conditions adversarial searches on multiple attacker personas to discover more diverse and transferable jailbreaks, yielding richer safety fine-tuning datasets that boost model robustness on GPT-OSS 120B.
Leveraging RAG for Training-Free Alignment of LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 72 · internal anchor
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unverdicted · none · ref 21 · 2 links · internal anchor
OTora is a two-stage framework that generates insertion-aware adversarial triggers and ICL-guided genetic payloads to induce reasoning-level denial-of-service in tool-augmented LLM agents across multiple backbones while preserving task correctness.
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
Self-Mined Hardness for Safety Fine-Tuning cs.LG · 2026-05-04 · unverdicted · none · ref 5 · 2 links · internal anchor
Self-mined hardness from model rollouts lowers WildJailbreak attack success to 1-3% on Llama-3 models while raising over-refusal, mitigated by 1:1 interleaving with benign prompts.
RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs cs.LG · 2026-05-03 · unverdicted · none · ref 16 · internal anchor
RefusalGuard constrains updates in hidden representation space to preserve safety-relevant geometric structure during fine-tuning, maintaining low attack success rates on safety benchmarks while preserving task performance.
Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance cs.LG · 2026-05-01 · unverdicted · none · ref 25 · internal anchor
Stable-GFlowNet stabilizes GFN training for LLM red-teaming by eliminating Z estimation via pairwise comparisons and robust masking against noisy rewards while adding a fluency stabilizer.
Diversity in Large Language Models under Supervised Fine-Tuning cs.LG · 2026-04-30 · unverdicted · none · ref 67 · 2 links · internal anchor
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Towards Understanding the Robustness of Sparse Autoencoders cs.LG · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs cs.LG · 2026-04-12 · unverdicted · none · ref 37 · internal anchor
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance cs.LG · 2026-04-10 · unverdicted · none · ref 11 · internal anchor
Spectral geometry of LoRA adapters encodes training objective and predicts harmful compliance in language models.
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal cs.LG · 2026-04-09 · unverdicted · none · ref 52 · internal anchor
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond cs.LG · 2026-03-02 · conditional · none · ref 26 · internal anchor
SafeSci creates a large objective benchmark and training resource that reveals safety weaknesses in current LLMs for science and demonstrates measurable improvement through targeted fine-tuning.
BarrierSteer: LLM Safety via Learning Barrier Steering cs.LG · 2026-02-23 · unverdicted · none · ref 27 · internal anchor
BarrierSteer applies control barrier functions to LLM latent states for constraint-guided steering that reduces unsafe generations while preserving utility.
f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment cs.LG · 2026-02-05 · unverdicted · none · ref 19 · internal anchor
f-GRPO and f-HAL estimate f-divergences between reward-aligned and reward-unaligned response distributions and prove expected reward improvement for general LLM alignment.
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs cs.LG · 2025-12-11 · unverdicted · none · ref 19 · internal anchor
Using properties of positional embeddings, reasoning LLMs can be made to think, listen, and generate outputs asynchronously without any additional training, cutting time to first token to under 5 seconds.
Graph-Regularized Sparse Autoencoders for LLM Safety Steering cs.LG · 2025-12-07 · unverdicted · none · ref 21 · internal anchor
GSAE improves selective refusal on safety benchmarks by smoothing SAE directions over a co-activation graph and applying them via a two-gate controller, outperforming standard SAEs and baselines on Llama-3 and other models.
Exploring the Secondary Risks of Large Language Models cs.LG · 2025-06-14 · unverdicted · none · ref 57 · internal anchor
Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.
Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems cs.LG · 2025-06-11 · unverdicted · none · ref 80 · internal anchor
Introduces a Bayesian framework viewing LLM prompts as textual parameters and proposes MHLP, a novel MCMC algorithm using LLM proposals, to perform inference and improve accuracy plus uncertainty quantification on benchmarks.

Universal and Transferable Adversarial Attacks on Aligned Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer