A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
hub
arXiv preprint arXiv:2305.15324 , year=
21 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
Analysis of the LMArena dataset reveals heavy topic skew and varying model rankings, leading to an interactive visualization tool for users to define custom evaluation priorities on LLM leaderboards.
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
LLM embeddings enable strong retrodiction of masked GSS opinions via cross-validation and external validation but only modest performance on entirely unasked opinions.
A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
MalGEN generates 977 executable malware samples across 1920 settings, with 45.71% evading existing detection engines and exposing gaps in current defenses.
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
Interpretations of Articles 2(1), 2(6), and 2(8) of the AI Act support applying the regulation to internal AI deployment while allowing for R&D exceptions, with the provisions viewed as complementary.
A framework with seven dimensions for AI incident reporting systems is developed from literature and case studies in safety-critical industries to guide institutional design choices.
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
citing papers explorer
-
Measuring Safety Alignment Effects in Autonomous Security Agents
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
-
Overtrained, Not Misaligned
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
-
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
-
Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards
Analysis of the LMArena dataset reveals heavy topic skew and varying model rankings, leading to an interactive visualization tool for users to define custom evaluation priorities on LLM leaderboards.
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
LLM-Guided Prompt Evolution for Password Guessing
LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.
-
Benchmarking Misuse Mitigation Against Covert Adversaries
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Gemini: A Family of Highly Capable Multimodal Models
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
-
AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction
LLM embeddings enable strong retrodiction of masked GSS opinions via cross-validation and external validation but only modest performance on entirely unasked opinions.
-
Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security
A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.
-
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
-
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
-
MalGEN: A Testbed for Modeling and Evaluating Malware Behaviors
MalGEN generates 977 executable malware samples across 1920 settings, with 45.71% evading existing detection engines and exposing gaps in current defenses.
-
TrustLLM: Trustworthiness in Large Language Models
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
-
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
-
Risk Reporting for Developers' Internal AI Model Use
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
-
Internal Deployment in the AI Act
Interpretations of Articles 2(1), 2(6), and 2(8) of the AI Act support applying the regulation to internal AI deployment while allowing for R&D exceptions, with the provisions viewed as complementary.
-
Designing Incident Reporting Systems for Harms from General-Purpose AI
A framework with seven dimensions for AI incident reporting systems is developed from literature and case studies in safety-critical industries to guide institutional design choices.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.