Summing up the facts: Additive mechanisms behind factual recall in llms, 2024

Chughtai, B · 2024 · arXiv 2402.07321

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Data-driven Circuit Discovery for Interpretability of Language Models

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.

How Language Models Process Negation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

LLMs process negation using both attention-based suppression and constructive representation mechanisms (construction dominant), with late-layer attention shortcuts explaining poor accuracy on negation tasks.

Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

cs.CL · 2026-04-02 · unverdicted · novelty 6.0

MoE expert neurons show lower polysemanticity than dense FFN neurons, widening with sparser routing, and experts specialize in fine-grained tasks like specific linguistic operations, supporting expert-level interpretability.

Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis

cs.CL · 2025-09-29 · unverdicted · novelty 6.0

A new framework using Task Subspace Logit Attribution localizes attention heads specialized for task recognition and task learning in in-context learning, showing they align and rotate hidden states within a task subspace.

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

cs.CL · 2025-09-07 · unverdicted · novelty 6.0

Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.

Tracing Relational Knowledge Recall in Large Language Models

cs.CL · 2026-04-21 · unverdicted · novelty 5.0

Per-head attention contributions to the residual stream serve as strong linear features for classifying relational knowledge in LLMs, with probe accuracy correlating to relation specificity and signal distribution.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Data-driven Circuit Discovery for Interpretability of Language Models cs.AI · 2026-05-09 · unverdicted · none · ref 2
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings cs.LG · 2026-04-09 · unverdicted · none · ref 8
Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.

Summing up the facts: Additive mechanisms behind factual recall in llms, 2024

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer