pith. sign in

hub Mixed citations

Steering Llama 2 via Contrastive Activation Addition , url =

Mixed citation behavior. Most common role is background (40%).

49 Pith papers citing it
23 external citations · Crossref
Background 40% of classified citations

hub tools

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

years

2026 47 2025 2

clear filters

representative citing papers

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

Psychological Steering of Large Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Activation Steering with a Feedback Controller

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.

Mechanistically Eliciting Latent Behaviors in Language Models

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

CPE is an unsupervised tensor-decomposition method that finds interpretable LoRAs to surface hidden LLM behaviors, matching supervised methods on some tasks and revealing failure modes like sandbagging and alignment-faking.

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

cs.CL · 2026-06-22 · unverdicted · novelty 6.0 · 2 refs

No tested LLM reliably self-reports adversarial prefill attacks on its outputs; introspective signals are largely refusal-mediated, probe-dependent, and only partially improvable by targeted training.

Inside the LLM Word Factory

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.