Advanc- ing llm safe alignment with safety representation ranking

Du, T · 2025 · arXiv 2505.15710

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

cs.SE · 2026-02-02 · unverdicted · novelty 7.0

RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.

A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

The work introduces a distributional view of visual mechanistic interpretability that casts the task as KL-minimal optimization and realizes it through a soft-constraint principle implemented with energy-guided diffusion posterior sampling on models such as DINOv3.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.

Enabling Performant and Flexible Model-Internal Observability for LLM Inference

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.

Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.

ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

cs.CR · 2025-06-02 · unverdicted · novelty 5.0

ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.

citing papers explorer

Showing 6 of 6 citing papers.

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing cs.SE · 2026-02-02 · unverdicted · none · ref 20
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle cs.CV · 2026-05-17 · unverdicted · none · ref 2
The work introduces a distributional view of visual mechanistic interpretability that casts the task as KL-minimal optimization and realizes it through a soft-constraint principle implemented with energy-guided diffusion posterior sampling on models such as DINOv3.
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unverdicted · none · ref 35 · 2 links
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
Enabling Performant and Flexible Model-Internal Observability for LLM Inference cs.LG · 2026-05-11 · unverdicted · none · ref 8
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance cs.CV · 2026-04-10 · unverdicted · none · ref 7
Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction cs.CR · 2025-06-02 · unverdicted · none · ref 65
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.

Advanc- ing llm safe alignment with safety representation ranking

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer