Successor heads: Recurring, interpretable attention heads in the wild

Rhys Gould, Euan Ong, George Ogden, Arthur Conmy · 2023 · arXiv 2312.09230

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems

cs.CR · 2025-06-03 · unverdicted · novelty 6.0

Introduces six-dimension trustworthiness definition and attention-based A-Trust score with a TMS to improve LLM-MAS robustness against malicious or unreliable messages.

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

cs.LG · 2024-03-28 · unverdicted · novelty 6.0

Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.

citing papers explorer

Showing 2 of 2 citing papers.

To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems cs.CR · 2025-06-03 · unverdicted · none · ref 28
Introduces six-dimension trustworthiness definition and attention-based A-Trust score with a TMS to improve LLM-MAS robustness against malicious or unreliable messages.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models cs.LG · 2024-03-28 · unverdicted · none · ref 26
Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.

Successor heads: Recurring, interpretable attention heads in the wild

fields

years

verdicts

representative citing papers

citing papers explorer