ARMOR 2025 is a new benchmark with 519 doctrinally grounded prompts across a 12-category OODA-based taxonomy that reveals critical safety gaps in 21 commercial LLMs for military applications.
hub Mixed citations
Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models
Mixed citation behavior. Most common role is background (40%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
f-GRPO and f-HAL estimate f-divergences between reward-aligned and reward-unaligned response distributions and prove expected reward improvement for general LLM alignment.
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.
ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.
citing papers explorer
-
ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts
ARMOR 2025 is a new benchmark with 519 doctrinally grounded prompts across a 12-category OODA-based taxonomy that reveals critical safety gaps in 21 commercial LLMs for military applications.
-
LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
-
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
-
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.
-
Towards Understanding the Robustness of Sparse Autoencoders
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
-
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
-
f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
f-GRPO and f-HAL estimate f-divergences between reward-aligned and reward-unaligned response distributions and prove expected reward improvement for general LLM alignment.
-
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
-
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.
-
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
-
Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control
ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.
-
ShieldGemma: Generative AI Content Moderation Based on Gemma
ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.
- Self-Mined Hardness for Safety Fine-Tuning