{"total":12,"items":[{"citing_arxiv_id":"2606.28153","ref_index":27,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-06-26T14:51:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads active in mid-layers, producing robust harmful features usable for competitive training-free detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29659","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content","primary_cat":"cs.LG","submitted_at":"2026-05-28T09:21:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20626","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Safety Benchmarking via Item Response Theory","primary_cat":"cs.CY","submitted_at":"2026-05-26T17:35:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Item Response Theory enables adaptive and fixed-subset item selection that reduces safety benchmark costs by 80-99.9% while preserving high correlation with full rankings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22435","ref_index":127,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation","primary_cat":"cs.CL","submitted_at":"2026-05-21T13:02:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10639","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks","primary_cat":"cs.AI","submitted_at":"2026-05-11T14:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Following HumanToxiGen [9] 2022 Implicit Toxicity, Hate Speech Text Generation, Classification HateBERT [1], ToxDectRoBERTa [36]AdvBench [37] 2023 Adversarial Robustness Text Generation, Instr. Following Automated MetricsDoNotAnswer [30] 2023 Harmfulness QA, Instr. Following Human, LLM-as-a-JudgeMaliciousInstruct [11] 2023 Jailbreak Robustness Instruction Following Automated ClassifierSafetyBench [34] 2023 Comprehensive Safety Multiple-Choice QA AccuracySimpleSafetyTests [29] 2023 Harmfulness QA, Instr. Following Human, Automated ClassifiersToxicChat [17] 2023 Toxicity Toxicity Classification Classification MetricsXSTest [24] 2023 Over-refusal QA, Instr. Following Human, Rule-based, LLM-as-a-JudgeBeHonest [3] 2024 Honesty, Misinformation QA Rule-based, LLM-as-a-JudgeHarmBench [20] 2024 Red Teaming, Robust Refusal Instruction Following LLM-as-a-JudgeJailbreakBench [2] 2024 Jailbreak Robustness Instruction Following Rule-based, LLM-as-a-JudgeSALAD-Bench [14] 2024 Comprehensive Safety QA, Instr."},{"citing_arxiv_id":"2605.07982","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GLiGuard: Schema-Conditioned Classification for LLM Safeguard","primary_cat":"cs.CL","submitted_at":"2026-05-08T16:44:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"3 Label Embedding Extraction After encoding, we extract the hidden states at the positions of the [L] tokens, which serve as the contextualized label representations used for classification. For task k with Mk labels, we obtain: e(L) k,i =h jLi ,i=1, . . . ,M k (6) where jLi is the position of the i-th [L] token for task k. This yields the label embedding matrix Ek = [e (L) k,1 , . . ., e(L) k,Mk ]∈R Mk×d. Because each [L] token is processed under full bidirectional attention jointly with the entire input, its hidden state isnota static token embedding: it is informed by all other labels in the task and the complete input text, yielding rich context-aware label representations (Figure 3, step3). 3.4 Classification Head The classification head operates on the label embeddings e(L)"},{"citing_arxiv_id":"2605.06605","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:25:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[33] Bat-Sheva Einbinder, Shai Feldman, Stephen Bates, Anastasios N Angelopoulos, Asaf Gendler, and Yaniv Romano. Label noise robustness of conformal prediction.Journal of Machine Learning Research, 25(328):1-66, 2024. [34] Coby Penso and Jacob Goldberger. A conformal prediction score that is robust to label noise. arXiv preprint arXiv:2405.02648, 2024. [35] Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A Hale, and Paul Röttger. SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models.arXiv preprint arXiv:2311.08370, 2023. [36] Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable"},{"citing_arxiv_id":"2605.05678","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering","primary_cat":"cs.AI","submitted_at":"2026-05-07T05:12:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"and remove near-duplicates with MinHash-LSH using token-level Jaccard similarity [39]. After filter- ing and deduplication, we split the pool into a 41K in-distribution diagnostic/centroid-construction pool and a 2K held-out test set using a source-stratified split. For robustness evaluation, we construct a separate out-of-distribution (OOD) set fromAdvBench[13], SaladBench[ 18],SimpleSafetyTests[ 17], andWildJailbreak[ 40]. These prompts are processed with the same filtering and deduplication pipeline. Details are provided in Appendix D. General-ability benchmarks.For capability retention, we evaluate steered models on BBH [ 41], GSM8K [42], and MMLU [43]. These benchmarks are used only for evaluation, not for selecting steering directions or tuning safety thresholds."},{"citing_arxiv_id":"2604.18519","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM Safety From Within: Detecting Harmful Content with Internal Representations","primary_cat":"cs.AI","submitted_at":"2026-04-20T17:17:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"MLP Classifier Training.The MLP classifier on top of aggregated safety neurons is optimized via Optuna (Akiba et al., 2019) with cross-validation. We search: the number of hidden layers, hidden Hyperparameter Value/Range Probe L1 regularizationC[100, 1000] Neuron thresholdη[0.6, 0.9] MLP hidden layers [2, 3] MLP hidden dimensions [64, 2048] MLP dropout [0.2, 0.5] Optuna trials 32 Cross-validation folds 3 Table 6: Key hyperparameters for SIREN training. Ranges indicate search spaces. dimensions, dropout rates, and learning rate. Each trial trains with early stopping; the final model uses the best hyperparameters identified via cross- validation and trains until convergence. A.2 Hyperparameter Selection We provide empirically effective hyperparameter"},{"citing_arxiv_id":"2604.07655","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-08T23:47:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"50000 /unlock-altJailbreakin-the-wild-jailbreak-prompts [72] 1558 /unlock-altJailbreaktrustgen[26] 596 /user-secretPrivacy TrustGen-Privacy[26] 4036 ♂shield-altRobustnessbbh[69] 500 ♂shield-altRobustnesscnn_dailymail[64] 1000 ♂shield-altRobustnesscommonsense_qa[70] 500 ♂shield-altRobustnessmmlu[24] 1000 ♂shield-altRobustnessmnli[81] 1000 ♂shield-altRobustnessqnli[74] 500 ♂shield-altRobustnesssst2[67] 500 ♂shield-altRobustnesstrivia_qa[37] 1000 ♂shield-altRobustnesstruthful_qa[41] 200 ♂shield-altRobustnessultrachat[15] 3000 ♂skull-crossbonesToxicity FredZhang7-toxi-text-3M [88] 10000 ♂skull-crossbonesToxicity JBB-Behaviors[8] 100 ♂skull-crossbonesToxicity PKU-SafeRLHF-QA[71] 5827 ♂skull-crossbonesToxicity StrongReject[68] 313"},{"citing_arxiv_id":"2604.06233","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules","primary_cat":"cs.AI","submitted_at":"2026-04-03T13:53:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.18495","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs","primary_cat":"cs.CL","submitted_at":"2024-06-26T16:58:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}