Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
hub
Transactions on Machine Learning Research , author=
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
ActiveDPO is a theoretically grounded active data selection method for sample-efficient LLM alignment that parameterizes the reward model directly with the LLM being aligned.
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.
Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.
WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.
Ridge regression in high dimensions exhibits power-law scalings because covariance fluctuations renormalize the ridge parameter, allowing closed-form error expressions and bias-variance decompositions for random feature models via free probability.
ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
citing papers explorer
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
ActiveDPO is a theoretically grounded active data selection method for sample-efficient LLM alignment that parameterizes the reward model directly with the LLM being aligned.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Belief or Circuitry? Causal Evidence for In-Context Graph Learning
Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.
-
Scheming Ability in LLM-to-LLM Strategic Interactions
Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.
-
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.
-
Scaling and renormalization in high-dimensional regression
Ridge regression in high dimensions exhibits power-law scalings because covariance fluctuations renormalize the ridge parameter, allowing closed-form error expressions and bias-variance decompositions for random feature models via free probability.
-
Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control
ShaPO improves LLM safety robustness over standard preference optimization by enforcing worst-case objectives via selective geometry control at token and reward levels.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.