Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Do Androids Know They’re Only Dreaming of Electric Sheep?
8 Pith papers cite this work, alongside 3 external citations. Polarity classification is still indexing.
years
2026 8representative citing papers
Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
TriLens detects hallucinations via per-layer entropy trajectories of logit-lens readouts from three internal modules across LLMs and QA benchmarks.
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
A factorized study finds raw hidden states and attention features hard to beat in-domain for LLM uncertainty probes, but structured compressed features are more robust under distribution shift, with pretrained probes transferring to open-ended generation.
Disentangling input ambiguity from uncertainty quantification improves error prediction for LLMs on QA tasks, yielding over 10 PRR point gains across models and datasets.
KL divergence of attention heads from uniform distribution predicts LLM answer correctness across datasets and model families.
citing papers explorer
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
LLM Self-Recognition: Steering and Retrieving Activation Signatures
Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.
-
Boosting Self-Consistency with Ranking
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
-
TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection
TriLens detects hallucinations via per-layer entropy trajectories of logit-lens readouts from three internal modules across LLMs and QA benchmarks.
-
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
-
From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models
A factorized study finds raw hidden states and attention features hard to beat in-domain for LLM uncertainty probes, but structured compressed features are more robust under distribution shift, with pretrained probes transferring to open-ended generation.
-
The Role of Ambiguity in Error Prediction via Uncertainty Quantification
Disentangling input ambiguity from uncertainty quantification improves error prediction for LLMs on QA tasks, yielding over 10 PRR point gains across models and datasets.
-
Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals
KL divergence of attention heads from uniform distribution predicts LLM answer correctness across datasets and model families.