{"total":58,"items":[{"citing_arxiv_id":"2606.28770","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mechanistic Personality Analysis of LLMs Steering Personality via Latent Feature Interventions","primary_cat":"cs.AI","submitted_at":"2026-06-27T06:53:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Applies sparse autoencoders to locate and steer latent features for OCEAN personality traits in LLMs while preserving benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22686","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs","primary_cat":"cs.CR","submitted_at":"2026-06-21T22:04:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Contrastive Logit Steering isolates a linear refusal direction in safety-aligned LLMs, achieving higher jailbreak success than activation steering and enabling bidirectional control without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22211","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Open AI in the Wild: Adoption and Adaptation of Open Models on r/LocalLLaMA","primary_cat":"cs.HC","submitted_at":"2026-06-20T20:14:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Thematic analysis of r/LocalLLaMA discussions finds users define openness via reliability, local control, privacy, and adaptation under compute, licensing, and usability constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06333","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability","primary_cat":"cs.LG","submitted_at":"2026-06-04T16:08:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03486","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense","primary_cat":"cs.CR","submitted_at":"2026-06-02T11:01:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"NeuroArmor uses safe-variant-guided representation consistency checks for selective intervention, reducing jailbreak ASR from 41.56% to 1.57% and benign FPR from 30.26% to 22.05% on Llama-3-8B-Instruct.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01196","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Low-Resource Safety Failures Are Action Failures, Not Representation Failures","primary_cat":"cs.CL","submitted_at":"2026-05-31T12:19:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01060","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-31T07:05:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MENTIS applies layerwise covariance torsion (T1), spectral torsion (T2), and ERA localization to paired IT/PA 7-8B models, finding selective larger shifts for normative concepts, negative correlation with entropy, and mid-to-late layer peaks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00545","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition","primary_cat":"cs.LG","submitted_at":"2026-05-30T05:33:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"On Llama-3.1-70B-Instruct the Assistant persona functions as the sole canonical reference for cross-persona authorship judgments, with symmetric entropy gaps predicting only on its row and asymmetric surprise relative to the Assistant predicting off its row.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30162","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders","primary_cat":"cs.AI","submitted_at":"2026-05-28T16:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"No tested model showed robust format-independent refusal on biosecurity hazards; a new divergence score between behavioral labels and SAE activations separated responses in one preliminary case.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28467","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training","primary_cat":"cs.LG","submitted_at":"2026-05-27T13:33:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Activation-level consistency training (ACT) yields a robust defense against adaptive jailbreaks in reasoning models by aligning internal activations on clean and wrapped prompts, outperforming output-level variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27914","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm","primary_cat":"cs.CL","submitted_at":"2026-05-27T03:41:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27763","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving","primary_cat":"cs.LG","submitted_at":"2026-05-26T23:22:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces a paired testing protocol for batch-conditioned refusal robustness in LLM serving and reports low rates of genuine safety-label flips after adjudication, with a batch-invariant kernel ablation eliminating observed flips.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25510","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-25T07:14:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces KIDBench benchmark for child-facing LLM safety, showing implicit and explicit child context cues raise safety scores 9-77% while multi-turn interactions degrade quality 6-24%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24856","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Concept Allocation Zone: Tracking How Concepts Form Across Transformer Depth","primary_cat":"cs.LG","submitted_at":"2026-05-24T04:25:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces CAZ framework using Separation, Coherence, and Velocity metrics to identify depth regions of concept allocation, with empirical tests across 34 models showing multimodal separation curves and causally active gentle CAZes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24552","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling","primary_cat":"cs.CR","submitted_at":"2026-05-23T12:39:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ellipsoid Control is a white-list test-time jailbreak defense that fits an anisotropic ellipsoid from benign activations to constrain projected gradient descent updates, aiming to improve the safety-utility tradeoff over black-list RepE methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24279","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions","primary_cat":"cs.CL","submitted_at":"2026-05-22T23:13:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22462","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-21T13:25:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A five-stage causal feature analysis methodology is proposed and tested on GPT-2 for IOI, showing partial causality of SAE features, robustness differences under shifts, and deployment cost benefits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20262","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing","primary_cat":"cs.LG","submitted_at":"2026-05-18T18:17:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18918","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense","primary_cat":"cs.CR","submitted_at":"2026-05-18T06:57:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ESLD extracts safety signals directly from the latent space of any guard model to enable faster and more accurate prompt-injection detection without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17413","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications","primary_cat":"cs.CR","submitted_at":"2026-05-17T12:18:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17231","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-17T03:00:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FishBack derives a closed-form minimum-distortion steering direction from the pullback Fisher metric of the softmax layer, outperforming Euclidean baselines on GPT-2 verb-morphology tasks with lower off-target KL divergence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17173","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Why Do Safety Guardrails Degrade Across Languages?","primary_cat":"cs.CL","submitted_at":"2026-05-16T22:08:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A latent variable IRT framework decouples four safety-driving factors across 61 model configurations and 10 languages using 1.9 million evaluations, revealing that safety is largely unidimensional and that high cross-lingual gaps cluster in physical harm prompts and lower-resource languages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15053","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale","primary_cat":"cs.LG","submitted_at":"2026-05-14T16:46:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TFGN is an architectural overlay for transformers enabling task-free, replay-free continual pre-training across heterogeneous domains at LLM scale with near-zero backward transfer and high gradient orthogonality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14218","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fusion-fission forecasts when AI will shift to undesirable behavior","primary_cat":"cs.AI","submitted_at":"2026-05-14T00:26:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A vector generalization of fusion-fission group dynamics from physics forecasts when AI behavior shifts to undesirable states, validated at 90 percent across seven models and prior to real-world data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13339","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Probing Persona-Dependent Preferences in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-13T10:57:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12874","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features","primary_cat":"cs.LG","submitted_at":"2026-05-13T01:41:38+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12726","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Before the Last Token: Diagnosing Final-Token Safety Probe Failures","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:30:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12412","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space","primary_cat":"cs.CL","submitted_at":"2026-05-12T17:09:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12400","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:00:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OGLS-SD improves on-policy self-distillation stability and math reasoning performance by constructing an outcome-discriminative steering direction from contrasts between successful and failed teacher logits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11448","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Deep Minds and Shallow Probes","primary_cat":"cs.LG","submitted_at":"2026-05-12T02:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024. URLhttps://arxiv.org/abs/2406.11717. [36] Henry Papadatos and Rachel Freedman. Linear probe penalties reduce LLM sycophancy. InNeurIPS 2024 Workshop on Socially Responsible Language Modelling Research (SoLaR), 2024. URLhttps: //openreview.net/forum?id=6N2yES22rG. 14 [37] Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn. Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590, 2023. URL https: //arxiv.org/abs/2311.07590. [38] Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, et al. Auditing language models for hidden objectives.arXiv preprint arXiv:2503."},{"citing_arxiv_id":"2605.10310","ref_index":1,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Positive Alignment: Artificial Intelligence for Human Flourishing","primary_cat":"cs.AI","submitted_at":"2026-05-11T10:11:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08765","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning","primary_cat":"cs.LG","submitted_at":"2026-05-09T07:50:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Existing LLM unlearning methods fail honesty standards by hallucinating on forgotten knowledge; ReVa improves rejection rates nearly twofold while enhancing retained honesty.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07990","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tool Calling is Linearly Readable and Steerable in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-08T16:47:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Gemma 3 270M27 [14,44] 10 [3,26] 0 [0,11] 30 [17,48] Gemma 3 1B 43 [27,61] 77 [59,88] 53 [36,70] 87 [70,95] Gemma 3 4B 96 [89,98] 94 [84,98] 76 [63,86] 100 [89,100] Gemma 3 12B 97 [83,99] 90 [74,97] 80 [63,90] 100 [89,100] Gemma 3 27B 100 [89,100]77 [59,88] 80 [63,90] 100 [89,100] Qwen 3 0.6B 50 [33,67] 77 [59,88] 47 [30,64] 77 [59,88] Qwen 3 1.7B 80 [63,90] 87 [70,95] 60 [42,75] 100 [89,100] Qwen 3 4B 93 [79,98] 80 [63,90] 70 [52,83] 100 [89,100] Qwen 3 8B 100 [89,100]90 [74,97]100 [89,100]100 [89,100] Qwen 3 14B 97 [83,99] 87 [70,95] 63 [46,78] 100 [89,100] Llama 3.1 8B 83 [66,93] 80 [63,90]100 [89,100] 90 [74,97] Qwen 2.5 7B 93 [79,98] 90 [74,97] 73 [56,86] 100 [89,100] Table 22: Wilson 95% CIs for all models and bench-"},{"citing_arxiv_id":"2605.07284","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic","primary_cat":"cs.LG","submitted_at":"2026-05-08T05:47:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"These rescue rows are absolute logit gains. Rescue fractions divide the gain by the missing margin, Y(U_IT,L_IT) - Y(U_PT,L_IT), so they are not the same unit as the logit-gain rows. Table E.5: Final-layer feature rescue. Rescue metric, Llama/Mistral/Qwen family-balanced Estimate 95% CI Direct top-200 causal feature rescue +0.494 [+0.451, +0.539] Direct rescue fraction 8.1% [5.5%, 10.3%] 21 Rescue metric, Llama/Mistral/Qwen family-balanced Estimate 95% CI Causal minus matched-random rescue +0.561 [+0.510, +0.613] Causal minus matched-random rescue fraction 10.8% [7.6%, 13.7%] Causal minus same-delta-random rescue +0.471 [+0.427, +0.517] Causal minus same-delta-random rescue fraction 8.3% [5.7%, 10.6%] Per-family direct rescue is Llama +0."},{"citing_arxiv_id":"2605.06652","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:56:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06196","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-07T13:08:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Improving activation steering in language models with mean-centring.ArXiv, abs/2312.03813, 2023. URL https: //api.semanticscholar.org/CorpusID:266053529. [57] Jiaqi Chen, Ming Wang, Tingna Xie, Shi Feng, and Yongkang Liu. A systematic analysis of the impact of persona steering on llm capabilities. 2026. URL https://api.semanticscholar. org/CorpusID:287432603. [58] Xiachong Feng, Liang Zhao, Weihong Zhong, Yi-Chong Huang, Yuxuan Gu, Lingpeng Kong, Xiaocheng Feng, and Bing Qin. Persona: Dynamic and compositional inference-time personality control via activation vector algebra.ArXiv, abs/2602.15669, 2026. URL https://api. semanticscholar.org/CorpusID:285659291. 13 [59] OpenAI. Introducing gpt-5.4 mini and nano."},{"citing_arxiv_id":"2605.05715","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes","primary_cat":"cs.AI","submitted_at":"2026-05-07T05:58:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"119), and layer profiles diverge (Table 2). The elevated Qwen average is driven by KD geome- try (0.662); OT-specific specificity is comparable across architectures (Qwen: 0.152, Llama: 0.119), suggesting the OT encoding is similarly entan- gled in both models. A steering test on Qwen (n= 1,273 , nine configurations spanning lay- ers 5-18, amplitudes α∈[0.5,3.0] , and mode- specific/uniform/multi-layer variants; Section 5) yields ∆∈[−0.9,+0.8] pp (all p >0.05 ), provid- ing consistent evidence for the cross-architecture steering null across diverse hyperparameter set- tings. 4.9 Cross-Domain Validation: MMLU-STEM To test domain generality, we replicate the core pipeline on MMLU-STEM (Hendrycks et al., 2021)"},{"citing_arxiv_id":"2605.01609","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations","primary_cat":"cs.LG","submitted_at":"2026-05-02T21:20:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transformer activations show spectral anti-concentration for concepts in the tail while syntax prefers high-variance directions, forming a dual geometry.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02958","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection","primary_cat":"cs.CR","submitted_at":"2026-05-02T14:56:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Causal tracing reveals a persistent Refusal Trajectory in LLM hidden states; SALO detector using sparse activations from a layer window improves jailbreak detection across Qwen, Llama, and Mistral models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00236","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Attention Is Where You Attack","primary_cat":"cs.CR","submitted_at":"2026-04-30T21:15:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27861","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning","primary_cat":"cs.CR","submitted_at":"2026-04-30T13:44:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"ing metadata filtering with Llama-3-8B-Guard [ 12] verification, followed by the full three-stage deduplication procedure. Verified benign intents are subsequently decomposed using three distinct models:Mistral-Small-24B-Instruct-2501-abliterated[ 14], Qwen3-32B-abliterated[ 33], andQwen3-30B-A3B-abliterated, all of which are orthogonalized variants produced via refusal direction ablation [1]. Crucially, these models are identical to those employed for malicious decomposition. Sharing decomposition models across both benign and malicious data is a deliberate design choice in- tended to avoid superficial stylistic artifacts introduced by different generation processes. Fragmented Malicious Intents.To construct a comprehensive set of adversarial attack scenarios, we collect initial malicious intents"},{"citing_arxiv_id":"2604.27401","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-30T04:13:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"To test generalization, we evaluated on 200 prompts from HarmBench [Mazeika et al., 2024] and 200 diverse benign prompts. Table 3 shows the dose-response (full per-category results and generation examples in Appendix A). Table 3: Dose-response on 200 HarmBench prompts. All gap drops are statistically significant (p <10 −59). Control neurons show no effect. Neurons Gap drop 95% CIp-value 10−1.8% [−3.3,−0.2] 2.6×10 −2 20−39.7% [−43.0,−36.3] 7.2×10 −59 50−56.6% [−60.4,−52.8] 2.8×10 −74 100−64.3% [−68.5,−60.1] 6.2×10 −76 200−71.3% [−75.7,−66.9] 1.2×10 −80 Control-50+0.2% [+0.1,+0.4]- The gap drop differs between evaluation sets: −64% on the 16 identification prompts (Table 1), −56.6% on 200 held-out HarmBench prompts (Table 3), and −61% in the cross-architecture compar-"},{"citing_arxiv_id":"2604.27169","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Semantic Structure of Feature Space in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-29T20:17:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21152","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles","primary_cat":"cs.CY","submitted_at":"2026-04-22T23:33:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19018","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control","primary_cat":"cs.LG","submitted_at":"2026-04-21T03:09:46+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17663","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data","primary_cat":"cs.LG","submitted_at":"2026-04-19T23:26:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[16] Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Reft: Representation finetuning for language models, 2024. URL https://arxiv.org/abs/2404.03592. [17] Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction, 2024. URL https://arxiv.org/abs/2406.11717. [18] Sohan Venkatesh and Ashish Mahendran Kurapath. On the non-identifiability of steering vectors in large language models, 2026. URL https://arxiv.org/abs/2602.06801. [19] Soham Gadgil, Chris Lin, and Su-In Lee. Where to steer: Input-dependent layer selection for steering improves llm alignment, 2026. URL https://arxiv.org/abs/2604.03867. [20] Jiaqian Li, Yanshu Li, and Kuan-Hao Huang."},{"citing_arxiv_id":"2604.11663","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Why Do Large Language Models Generate Harmful Content?","primary_cat":"cs.AI","submitted_at":"2026-04-13T16:11:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"lows for improved moderation and guarding of content generation [1]. Mechanistic Interpretability for Harmful Content Generation.Recent work on un- derstanding harmful content generation applies mechanistic interpretability through manipulation of internal subspaces to bypass safeguards [19,7], analyzing prompt fea- tures [3], steering vectors[8], and input vectors of refusal and harmfulness prevention [2,10]. Similar works instead employ weak classifiers and logit grafting to modify hid- den states [29]. Other studies [4,6] utilize activation patching to isolate individual safety Harmful Generation 3 neurons or specific task vulnerabilities. Furthermore, other efforts propose a dual frame- work using representation analysis to characterize how jailbreaks alter the model's per-"},{"citing_arxiv_id":"2604.10990","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When Verification Fails: How Compositionally Infeasible Claims Escape Rejection","primary_cat":"cs.CL","submitted_at":"2026-04-13T04:48:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constraints but contradicted non-salient ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06247","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SALLIE: Safeguarding Against Latent Language & Image Exploits","primary_cat":"cs.CR","submitted_at":"2026-04-06T16:29:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"p(l) (x) = 1 km ∑ i∈N (l) km (x) 1[yi =1](3) where N (l) km (x) denotes the indices of the km nearest neighbors of ˜h(l) (x) under cosine similarity. Layer Ensemble.To reduce sensitivity to any single layer and obtain a more stable score, we aggregate predictions across the modality-specific layer rangeL m ⊆ {1, . . . ,L}: ¯p(x) = 1 |Lm| ∑ l∈L m p(l) (x)(4) The selection ofL m is discussed in Section 4. 3.3.1 Decision Rule The classification decision for inputx of modality m is governed by a step functionD relative to the thresholdτ m and layer rangeL m: D(x,L m) = \u001a1 (Prompt Attack) if ¯p(x)≥τ m 0 (Benign) otherwise (5) where τm ∈[ 0, 1] controls the trade-off between the false positive rate (FPR) and false"},{"citing_arxiv_id":"2604.04385","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-06T03:20:37+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}