RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
Does representation matter? exploring intermediate layers in large language models
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7representative citing papers
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.
Perceptual geometry for color, pitch, emotion and taste emerges transiently in intermediate layers of transformer LLMs despite purely textual training.
citing papers explorer
-
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
LLM Self-Recognition: Steering and Retrieving Activation Signatures
Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.
-
Towards Understanding the Robustness of Sparse Autoencoders
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
-
Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.
-
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.
-
Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations
Perceptual geometry for color, pitch, emotion and taste emerges transiently in intermediate layers of transformer LLMs despite purely textual training.