Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.
hub Mixed citations
arXiv preprint arXiv:2307.04657 , year =
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
SAD modifies the denoising process in text diffusion models to enforce safety constraints at inference time, reducing unsafe generations while preserving quality and diversity.
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
ActivationReasoning grounds logical reasoning in LLM latent activations via SAEs to enable structured inference, concept composition, and behavior steering on multi-hop, abstraction, and safety tasks.
TPCs allow term-by-term progressive polynomial evaluation on LLM activations for flexible safety monitoring that supports both stronger guardrails and low-cost adaptive cascades.
CoRT achieves 95% average attack success rate on nine LLMs by using iterative risk-concealing prompts and a controller that scores concealment levels on a new 522-instruction financial risk benchmark.
REFORM uses reward-guided controlled decoding to generate adversarial failures and augments training data to improve reward model robustness on preference datasets.
Disentangled Safety Adapters decouple safety computations from task-optimized LLMs via lightweight adapters, yielding up to 53% better AUC on safety tasks and dynamic inference-time alignment with reduced performance trade-offs.
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
WildFeedback extracts preference pairs from in-situ user feedback in LLM conversations to fine-tune models for better alignment with real user preferences.
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
citing papers explorer
-
Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)
Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.
-
Theoretical Limits of Language Model Alignment
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
-
The Safety-Aware Denoiser for Text Diffusion Models
SAD modifies the denoising process in text diffusion models to enforce safety constraints at inference time, reducing unsafe generations while preserving quality and diversity.
-
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
-
Alignment Dynamics in LLM Fine-Tuning
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
-
LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
-
GLiGuard: Schema-Conditioned Classification for LLM Safeguard
GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
ActivationReasoning: Logical Reasoning in Latent Activation Spaces
ActivationReasoning grounds logical reasoning in LLM latent activations via SAEs to enable structured inference, concept composition, and behavior steering on multi-hop, abstraction, and safety tasks.
-
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
TPCs allow term-by-term progressive polynomial evaluation on LLM activations for flexible safety monitoring that supports both stronger guardrails and low-cost adaptive cascades.
-
Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain
CoRT achieves 95% average attack success rate on nine LLMs by using iterative risk-concealing prompts and a controller that scores concealment levels on a new 522-instruction financial risk benchmark.
-
Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling
REFORM uses reward-guided controlled decoding to generate adversarial failures and augments training data to improve reward model robustness on preference datasets.
-
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
Disentangled Safety Adapters decouple safety computations from task-optimized LLMs via lightweight adapters, yielding up to 53% better AUC on safety tasks and dynamic inference-time alignment with reduced performance trade-offs.
-
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
-
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
-
WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback
WildFeedback extracts preference pairs from in-situ user feedback in LLM conversations to fine-tune models for better alignment with real user preferences.
-
TrustLLM: Trustworthiness in Large Language Models
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
-
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout
FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
-
ShieldGemma: Generative AI Content Moderation Based on Gemma
ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.
-
Baichuan 2: Open Large-scale Language Models
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.
-
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.