pith. sign in

hub Mixed citations

Steering Llama 2 via Contrastive Activation Addition , url =

Mixed citation behavior. Most common role is background (40%).

51 Pith papers citing it
23 external citations · Crossref
Background 40% of classified citations

hub tools

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

years

2026 49 2025 2

clear filters

representative citing papers

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

How Language Models Process Negation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

LLMs process negation using both attention-based suppression and constructive representation mechanisms (construction dominant), with late-layer attention shortcuts explaining poor accuracy on negation tasks.

Psychological Steering of Large Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Activation Steering with a Feedback Controller

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.

Mechanistically Eliciting Latent Behaviors in Language Models

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

CPE is an unsupervised tensor-decomposition method that finds interpretable LoRAs to surface hidden LLM behaviors, matching supervised methods on some tasks and revealing failure modes like sandbagging and alignment-faking.

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

cs.CL · 2026-06-22 · unverdicted · novelty 6.0 · 2 refs

No tested LLM reliably self-reports adversarial prefill attacks on its outputs; introspective signals are largely refusal-mediated, probe-dependent, and only partially improvable by targeted training.

Inside the LLM Word Factory

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • Latent-space Attacks for Refusal Evasion in Language Models cs.AI · 2026-05-20 · unverdicted · none · ref 12 · 2 links

    Refusal suppression via difference-in-means ablation equals projection onto a linear probe's decision boundary, and a controlled evasion attack optimizing confidence past the boundary achieves SOTA success rates on 15 models.

  • TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction cs.AI · 2026-05-18 · unverdicted · none · ref 36

    TRACE uses cross-layer candidate trajectories inside frozen LLMs to dynamically select and apply one of three correction operators, delivering mean gains of +12.26 MC1 and +8.65 MC2 points across 15 models and 3 benchmarks with no regressions.

  • Do Linear Probes Generalize Better in Persona Coordinates? cs.AI · 2026-05-10 · unverdicted · none · ref 71

    Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.

  • A Geometric Account of Activation Steering through Angle-Norm Decomposition cs.AI · 2026-06-04 · unverdicted · none · ref 1

    Empirical study across seven language models finds concepts represented primarily in angular structure of activations while norm affects steering stability, recommending separate angular and radial parameterization over single additive coefficients.