pith. sign in

ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Large Language Models (LLMs) have achieved tremendous success in various tasks, yet concerns about their safety and security have emerged. In particular, they pose risks of generating harmful content and are vulnerable to jailbreaking attacks, creating unaddressed security issues regarding their deployments. In the context of software engineering for artificial intelligence (SE4AI) techniques, model-based analysis has demonstrated notable potential for analyzing and monitoring machine learning models, particularly in stateful deep neural networks. However, it suffers from scalability issues when extended to LLMs due to their vast feature spaces. In this paper, we aim to address the scalability issue of model-based analysis techniques for safeguarding LLM-scale models. Motivated by the recent discovery of low-dimensional safety-critical representations that emerged in LLMs, we propose ReGA, a model-based analysis framework with Representation-Guided Abstraction, to safeguard LLMs against harmful prompts and generations. By leveraging safety-critical representations, which are key directions in hidden states that indicate safety-related concepts, ReGA effectively narrows the scalability gap when developing the abstract model for safety modeling. Our comprehensive evaluation shows that ReGA performs sufficiently well in distinguishing between safe and harmful inputs, achieving an AUROC of 0.975 at the prompt level and 0.985 at the conversation level. Additionally, ReGA exhibits robustness to real-world attacks and generalization across different safety perspectives, outperforming existing safeguard paradigms in terms of interpretability and scalability. Overall, ReGA serves as an efficient and scalable solution to enhance LLM safety by integrating representation engineering with model-based abstraction, paving the way for new paradigms to utilize software insights for AI safety.

fields

cs.LG 1 cs.SE 1

years

2026 1 2025 1

verdicts

UNVERDICTED 2

representative citing papers

Secure LLM Fine-Tuning via Safety-Aware Probing

cs.LG · 2025-05-22 · unverdicted · novelty 6.0

SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.

citing papers explorer

Showing 2 of 2 citing papers.

  • RACC: Representation-Aware Coverage Criteria for LLM Safety Testing cs.SE · 2026-02-02 · unverdicted · none · ref 55 · internal anchor

    RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.

  • Secure LLM Fine-Tuning via Safety-Aware Probing cs.LG · 2025-05-22 · unverdicted · none · ref 29 · internal anchor

    SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.