pith. sign in

super hub Mixed citations

Representation Engineering: A Top-Down Approach to AI Transparency

Mixed citation behavior. Most common role is background (62%).

274 Pith papers citing it
Background 62% of classified citations
abstract

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

hub tools

citation-role summary

background 17 baseline 2 method 2

citation-polarity summary

claims ledger

  • abstract In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and con

authors

co-cited works

clear filters

representative citing papers

Rift: A Conflict Signature for Deception in Language Models

cs.LG · 2026-06-15 · conditional · novelty 8.0

Deceptive forward passes show 2.1-2.3x higher residual rank than naive-liar passes on identical wrong answers, enabling label-free lie identification at 100% accuracy across GPT-2, Qwen, and Phi models with cross-family and cross-language transfer.

FloatDoor: Platform-Triggered Backdoors in LLMs

cs.CR · 2026-06-17 · unverdicted · novelty 7.0

FloatDoor uses two LoRA adapters to create the first input-independent backdoor that triggers adversary-chosen behavior only on a target platform while remaining benign elsewhere.

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

cs.LG · 2026-06-13 · unverdicted · novelty 7.0

Cosine-scored SAEs with a learned direction-magnitude blend learn more concept-aligned features than standard inner-product SAEs at matched reconstruction quality.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation cs.AI · 2025-03-14 · conditional · none · ref 70 · internal anchor

    Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.

  • Reasoning Can Be Restored by Correcting a Few Decision Tokens cs.AI · 2026-05-16 · conditional · none · ref 30 · internal anchor

    Reasoning gaps between base LLMs and LRMs concentrate on ~8% of early planning tokens; intervening with the reasoning model only at high-disagreement positions recovers performance.

  • Belief or Circuitry? Causal Evidence for In-Context Graph Learning cs.AI · 2026-05-08 · conditional · none · ref 14 · internal anchor

    Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.

  • Characterizing Model-Native Skills cs.AI · 2026-04-19 · conditional · none · ref 46 · internal anchor

    Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.