Grains: Gradient-based attribution for inference-time steering of llms and vlms.CoRR, abs/2507.18043, 2025a

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal · 2025 · arXiv 2507.18043

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.

Continuous Interpretive Steering for Scalar Diversity

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

Continuous Interpretive Steering and the GraSD dataset reveal that LLMs encode graded sensitivity to scalar diversity in their internal representations, recoverable via controlled activation interventions.

The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MATH when transferring CoT from 14B to 7B models.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

citing papers explorer

Showing 4 of 4 citing papers.

Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection cs.LG · 2026-05-27 · unverdicted · none · ref 33
Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
Continuous Interpretive Steering for Scalar Diversity cs.CL · 2026-04-08 · unverdicted · none · ref 1
Continuous Interpretive Steering and the GraSD dataset reveal that LLMs encode graded sensitivity to scalar diversity in their internal representations, recoverable via controlled activation interventions.
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment cs.LG · 2026-04-07 · unverdicted · none · ref 44
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MATH when transferring CoT from 14B to 7B models.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 220
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Grains: Gradient-based attribution for inference-time steering of llms and vlms.CoRR, abs/2507.18043, 2025a

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer