Title resolution pending

Representation Engineering: A Top-Down Approach to AI Transparency , author= · 2025

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.

Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States

cs.RO · 2026-05-16 · conditional · novelty 6.0

COAST applies contrastive conceptors to steer VLA hidden states into task-specific success subspaces, yielding over 20% simulation and 40% real-robot success rate gains across three distinct policies.

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

A geometric relation of the error introduced by sampling a language model's output distribution to its internal state

cs.LG · 2026-05-06 · unverdicted · novelty 5.0 · 2 refs

A geometric 1-form on token embeddings has curvature that couples to semantic world models in language models, as evidenced by clustering on chess board regions and piece importance.

citing papers explorer

Showing 5 of 5 citing papers.

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition cs.LG · 2026-05-14 · unverdicted · none · ref 33
QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.
Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States cs.RO · 2026-05-16 · conditional · none · ref 13
COAST applies contrastive conceptors to steer VLA hidden states into task-specific success subspaces, yielding over 20% simulation and 40% real-robot success rate gains across three distinct policies.
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs cs.CL · 2026-05-11 · unverdicted · none · ref 9
Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.
LLM Safety From Within: Detecting Harmful Content with Internal Representations cs.AI · 2026-04-20 · unverdicted · none · ref 72
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
A geometric relation of the error introduced by sampling a language model's output distribution to its internal state cs.LG · 2026-05-06 · unverdicted · none · ref 27 · 2 links
A geometric 1-form on token embeddings has curvature that couples to semantic world models in language models, as evidenced by clustering on chess board regions and piece importance.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer