hub Mixed citations

arXiv:2406.14598 (2025)

Xie, Z · 2025 · arXiv 2406.14598

Mixed citation behavior. Most common role is background (60%).

28 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 28 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 3 unclear 1 use method 1

representative citing papers

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

cs.SE · 2026-05-20 · conditional · novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

cs.AI · 2026-06-08 · conditional · novelty 7.0

A reliable-to-expressive curriculum with dynamic rubrics trains a 12B safety judge to achieve 94%+ accuracy with only 0.76 cross-rubric variance on three different rubric prompts.

LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

cs.CL · 2026-05-13 · conditional · novelty 7.0

LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

cs.CR · 2026-05-06 · conditional · novelty 7.0

TAGO performs sparse jailbreak optimization on audio LMs by retaining only high-gradient-energy tokens, preserving near-full ASR at 25% retention across three models.

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

FinSafetyBench shows that LLMs remain vulnerable to adversarial prompts that bypass financial compliance safeguards, with notably higher failure rates in Chinese-language scenarios.

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

cs.CR · 2024-10-03 · unverdicted · novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.

Addressing Over-Refusal in LLMs with Competing Rewards

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

cs.CL · 2026-06-24 · unverdicted · novelty 6.0

Fine-tuned ModernBERT-family encoders match LLM judges on F1, false negative rate, and precision-recall for harmful output detection across adversarial datasets and attack types while promising lower cost and latency.

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

cs.AI · 2026-06-18 · unverdicted · novelty 6.0

Safety-aligned LLMs treat benign and harmful compliance demonstrations differently in in-context learning, with preference optimization preventing benign examples from increasing harmful compliance and strong recency bias in ordering.

Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts

cs.CL · 2026-06-09 · unverdicted · novelty 6.0

Schützen is a German-Bulgarian LLM safety dataset showing pronounced cross-language differences in model safety behavior.

Efficient Safety Benchmarking via Item Response Theory

cs.CY · 2026-05-26 · unverdicted · novelty 6.0

Item Response Theory enables adaptive and fixed-subset item selection that reduces safety benchmark costs by 80-99.9% while preserving high correlation with full rankings.

Before the Last Token: Diagnosing Final-Token Safety Probe Failures

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

cs.CR · 2026-05-09 · unverdicted · novelty 6.0

A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

cs.CR · 2026-05-04 · accept · novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.

PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

cs.AI · 2026-05-01 · unverdicted · novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.

Reasoning Structure Matters for Safety Alignment of Reasoning Models

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

cs.AI · 2026-04-03 · unverdicted · novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

cs.CL · 2025-09-07 · unverdicted · novelty 6.0

Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

cs.AI · 2026-06-23 · unverdicted · novelty 5.0

PHANTOM is a consolidated open-source dataset of 47,524 multimodal adversarial samples for VLMs, extending prior benchmarks across 10 high-level categories and 55 subcategories of harmful intents.

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

cs.CR · 2026-05-04 · accept · novelty 5.0

The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.

Discriminatory Compliance: How LLMs Answer Queries from Protected Groups

cs.CY · 2026-06-19 · unverdicted · novelty 4.0

State-of-the-art LLMs respond inconsistently to queries from protected-group personas, with some responses omitting key information that should be provided.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

arXiv:2406.14598 (2025)

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer