arXiv preprint arXiv:2403.00745 , year=

Kram · 2024 · arXiv 2403.00745

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 2 method 2

citation-polarity summary

background 2 use method 2

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

How LLMs Are Persuaded: A Few Attention Heads, Rerouted

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.

Cell-Based Representation of Relational Binding in Language Models

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.

Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs

cs.AI · 2026-04-09 · accept · novelty 7.0

The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.

Not How Many, But Which: Parameter Placement in Low-Rank Adaptation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.

Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.

Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

cs.LG · 2024-03-28 · unverdicted · novelty 6.0

Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.

citing papers explorer

Showing 8 of 8 citing papers.

WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 12 · 2 links
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
How LLMs Are Persuaded: A Few Attention Heads, Rerouted cs.AI · 2026-05-10 · unverdicted · none · ref 32
Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
Cell-Based Representation of Relational Binding in Language Models cs.CL · 2026-04-21 · unverdicted · none · ref 44
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs cs.AI · 2026-04-09 · accept · none · ref 52
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation cs.LG · 2026-05-12 · unverdicted · none · ref 47
Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure cs.LG · 2026-05-09 · unverdicted · none · ref 13
Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs cs.AI · 2026-04-15 · unverdicted · none · ref 24
Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models cs.LG · 2024-03-28 · unverdicted · none · ref 39
Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.

arXiv preprint arXiv:2403.00745 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer