super hub Canonical reference

Toy Models of Superposition

Catherine Olsson, Nelson Elhage, Nicholas Schiefer, Shauna Kravec, Tom Henighan, Tristan Hume · 2022 · cs.LG · arXiv 2209.10652

Canonical reference. 85% of citing Pith papers cite this work as background.

126 Pith papers citing it

Background 85% of classified citations

open full Pith review browse 126 citing papers more from Catherine Olsson arXiv PDF

abstract

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 19 method 1

citation-polarity summary

background 17 support 1 unclear 1 use method 1

claims ledger

abstract Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

authors

Catherine Olsson Nelson Elhage Nicholas Schiefer Shauna Kravec Tom Henighan Tristan Hume

co-cited works

representative citing papers

Mechanistic Interpretability and Causal Feature Steering of Neural Quantum States via Sparse Autoencoders

quant-ph · 2026-07-01 · unverdicted · novelty 8.0

Sparse autoencoders applied to Neural Quantum States extract unsupervised features correlating with and causally steering physical observables such as order parameters while preserving variational energy.

When Does LeJEPA Learn a World Model?

stat.ML · 2026-05-25 · unverdicted · novelty 8.0

LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 4 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

The Linear Representation Hypothesis and the Geometry of Large Language Models

cs.CL · 2023-11-07 · conditional · novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.

PairSAE: Mechanistic Interpretability from Pair Representations in Protein Co-Folding

cs.LG · 2026-06-25 · unverdicted · novelty 7.0

Introduces PairSAE, a sparse autoencoder for pair representations in structural biology foundation models that produces features aligned with UniProt annotations and affinity predictions.

Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

cs.LG · 2026-06-25 · unverdicted · novelty 7.0

Sparsity regularizers applied before Top-k selection in SAEs improve monosemanticity and make reconstruction robust to inference-time k across vision models and datasets.

Embodied Explainability and Ontological Obstacles: Why We Struggle to Explain the Answers of Large Language Models (LLMs)

cs.HC · 2026-06-22 · unverdicted · novelty 7.0

An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

Evaluation of two latent reasoning models against controls shows observable latent patterns appear without the proposed mechanisms, have graded causal effects on behavior, and concentrate in structured low-rank directions, arguing that patterns are insufficient evidence for reasoning.

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

VFUSE applies sparse autoencoders to diffusion-transformer activations in RoseTTAFold3 and RFDiffusion3 to find monosemantic features that detect hazardous protein designs with AUROC up to 0.84.

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.

Learning Coherent Representations: A Topological Approach to Interpretability

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

Introduces coherence as a topological constraint on representations and the Coh objective to enforce geometric clustering for interpretability in neural networks.

Subliminal Learning Is Steering Vector Distillation

cs.AI · 2026-05-31 · unverdicted · novelty 7.0

Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.

Subliminal Learning is a LoRA Artifact

cs.AI · 2026-05-30 · conditional · novelty 7.0

Subliminal learning is a LoRA artifact that disappears with full finetuning, depends on context tokens like system prompts, and localizes to overlapping finetuning-evaluation tokens.

Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU Activations

cs.LG · 2026-05-27 · conditional · novelty 7.0

SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.

Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering

cs.LG · 2026-05-24 · unverdicted · novelty 7.0

A Riemannian geodesic framework for label-free manifold steering in language models via a schema-supervised encoder approximating output Hellinger distance on activations.

Learning Through Noise: Why Subliminal Learning Works and When It Fails

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

cs.RO · 2026-05-17 · conditional · novelty 7.0

Event-grounded SAE analysis in VLA policies produces stronger causal effects on robot behavior than standard methods by anchoring features to clustered end-effector keyframes across simulations and real-robot tests.

The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.

SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.

Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

A framework using sparse autoencoders decomposes concept-level forgetting in supervised continual learning into apparent deletion, recoverability, and decodability, showing substantial recoverability under linearity and degrading decodability with added tasks.

citing papers explorer

Showing 50 of 112 citing papers after filters.

Mechanistic Interpretability and Causal Feature Steering of Neural Quantum States via Sparse Autoencoders quant-ph · 2026-07-01 · unverdicted · none · ref 57 · internal anchor
Sparse autoencoders applied to Neural Quantum States extract unsupervised features correlating with and causally steering physical observables such as order parameters while preserving variational energy.
When Does LeJEPA Learn a World Model? stat.ML · 2026-05-25 · unverdicted · none · ref 87 · internal anchor
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 15 · 4 links · internal anchor
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
Crafting Reversible SFT Behaviors in Large Language Models cs.LG · 2026-05-07 · unverdicted · none · ref 19 · internal anchor
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models cs.LG · 2026-06-30 · unverdicted · none · ref 3 · internal anchor
SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.
PairSAE: Mechanistic Interpretability from Pair Representations in Protein Co-Folding cs.LG · 2026-06-25 · unverdicted · none · ref 3 · internal anchor
Introduces PairSAE, a sparse autoencoder for pair representations in structural biology foundation models that produces features aligned with UniProt annotations and affinity predictions.
Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders cs.LG · 2026-06-25 · unverdicted · none · ref 6 · internal anchor
Sparsity regularizers applied before Top-k selection in SAEs improve monosemanticity and make reconstruction robust to inference-time k across vision models and datasets.
Embodied Explainability and Ontological Obstacles: Why We Struggle to Explain the Answers of Large Language Models (LLMs) cs.HC · 2026-06-22 · unverdicted · none · ref 36 · internal anchor
An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.
Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models cs.CL · 2026-06-10 · unverdicted · none · ref 38 · internal anchor
Evaluation of two latent reasoning models against controls shows observable latent patterns appear without the proposed mechanisms, have graded causal effects on behavior, and concentrate in structured low-rank directions, arguing that patterns are insufficient evidence for reasoning.
VFUSE: Virulent Feature Understanding with Sparse autoEncoders cs.LG · 2026-06-08 · unverdicted · none · ref 7 · internal anchor
VFUSE applies sparse autoencoders to diffusion-transformer activations in RoseTTAFold3 and RFDiffusion3 to find monosemantic features that detect hazardous protein designs with AUROC up to 0.84.
Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control cs.LG · 2026-06-03 · unverdicted · none · ref 18 · internal anchor
LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.
Learning Coherent Representations: A Topological Approach to Interpretability cs.LG · 2026-06-01 · unverdicted · none · ref 28 · internal anchor
Introduces coherence as a topological constraint on representations and the Coh objective to enforce geometric clustering for interpretability in neural networks.
Subliminal Learning Is Steering Vector Distillation cs.AI · 2026-05-31 · unverdicted · none · ref 13 · internal anchor
Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.
Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering cs.LG · 2026-05-24 · unverdicted · none · ref 3 · internal anchor
A Riemannian geodesic framework for label-free manifold steering in language models via a schema-supervised encoder approximating output Hellinger distance on activations.
Learning Through Noise: Why Subliminal Learning Works and When It Fails cs.LG · 2026-05-22 · unverdicted · none · ref 23 · internal anchor
Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge stat.ML · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing cs.LG · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning cs.LG · 2026-05-10 · unverdicted · none · ref 9 · internal anchor
A framework using sparse autoencoders decomposes concept-level forgetting in supervised continual learning into apparent deletion, recoverability, and decodability, showing substantial recoverability under linearity and degrading decodability with added tasks.
SMIXAE: Towards Unsupervised Manifold Discovery in Language Models cs.LG · 2026-05-09 · unverdicted · none · ref 1 · internal anchor
SMIXAE is a new mixture-of-autoencoders architecture that learns multidimensional manifolds directly from transformer activations, recovering known structures and identifying novel ones in Gemma 2 2B and 9B models.
From Mechanistic to Compositional Interpretability cs.LG · 2026-05-09 · unverdicted · none · ref 22 · 2 links · internal anchor
The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.
What Cohort INRs Encode and Where to Freeze Them cs.LG · 2026-05-08 · unverdicted · none · ref 19 · internal anchor
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval stat.ML · 2026-05-06 · unverdicted · none · ref 16 · internal anchor
Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior cs.LG · 2026-05-06 · unverdicted · none · ref 288 · internal anchor
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction cs.AI · 2026-05-04 · unverdicted · none · ref 5 · internal anchor
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
Adjoint Inversion Reveals Holographic Superposition and Destructive Interference in CNN Classifiers cs.CV · 2026-04-30 · unverdicted · none · ref 4 · internal anchor
CNN classifiers work by holographic superposition and destructive interference in pixel space rather than selecting cleaned features, as proven by a new adjoint inversion framework that also yields a covariance-volume channel selection algorithm.
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection cs.CV · 2026-04-29 · unverdicted · none · ref 11 · internal anchor
Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders cs.LG · 2026-04-21 · unverdicted · none · ref 37 · internal anchor
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
Cell-Based Representation of Relational Binding in Language Models cs.CL · 2026-04-21 · unverdicted · none · ref 48 · internal anchor
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
Psychological Steering of Large Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 14 · internal anchor
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision cs.CV · 2026-04-14 · unverdicted · none · ref 12 · internal anchor
Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP cs.CV · 2026-04-07 · unverdicted · none · ref 14 · internal anchor
The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.
Isotropic Activation Functions Enable Deindividuated Neurons and Adaptive Topologies cs.NE · 2026-02-26 · unverdicted · none · ref 24 · internal anchor
Isotropic activation functions derived from reparameterisation symmetries and SVD diagonalisation enable function-preserving neuron removal and addition in dense networks, supporting up to 50% sparsification and real-time topology adaptation.
Mechanistic Interpretability with Sparse Autoencoder Neural Operators cs.LG · 2025-09-03 · unverdicted · none · ref 24 · internal anchor
SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.
Scaling and evaluating sparse autoencoders cs.LG · 2024-06-06 · unverdicted · none · ref 13 · internal anchor
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Improving Dictionary Learning with Gated Sparse Autoencoders cs.LG · 2024-04-24 · unverdicted · none · ref 20 · internal anchor
Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.
HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment cs.AI · 2026-07-01 · unverdicted · none · ref 29 · internal anchor
HARC couples harmfulness and refusal directions across prompt and response positions via subspace fine-tuning, achieving better robustness-capability-usability trade-off than six baselines while transferring across model families.
Radical AI Interpretability cs.AI · 2026-06-25 · unverdicted · none · ref 32 · internal anchor
A framework is proposed for solving for an AI system's beliefs and desires from its computational facts, with criteria for success tied to interpretability tests and emphasis on holistic attribution.
Steering Vision-Language Models with Joint Sparse Autoencoders cs.CV · 2026-06-24 · unverdicted · none · ref 24 · internal anchor
JSAE jointly factorizes pooled vision and language activations in VLMs into aligned interpretable features, revealing layer-dependent asymmetry in additive steering versus suppression on three models.
Concept Removal for Frontier Image Generative Models cs.CV · 2026-06-24 · unverdicted · none · ref 143 · internal anchor
A transcoder-based in-place replacement of the bottleneck layer enables selective concept removal in modern diffusion and autoregressive image models without degrading output quality.
Evidence for feature-specific error correction in LLMs cs.LG · 2026-06-23 · unverdicted · none · ref 5 · internal anchor
Perturbation experiments across six LLMs show activation robustness follows L^p norm with p>2 for feature directions (contrastive, MELBO, SAE) but p≈2 for random/PCA controls, indicating feature-specific error correction.
Extraction and Analysis of Multimodal Concepts in Vision Language Models through Sparse Autoencoders cs.CV · 2026-06-19 · unverdicted · none · ref 6 · internal anchor
A new SAE-based framework extracts visual, textual, and multimodal concepts from VLMs and reports up to 45% better visual concept quality on a VQA dataset while identifying multimodal concepts.
Critical Percolation as a Synthetic Data Model for Interpretability cs.LG · 2026-06-18 · unverdicted · none · ref 19 · internal anchor
Critical percolation clusters embedded in high dimensions, combined with taxonomic latent variables, form an analytically tractable synthetic data model whose ground-truth hierarchy can be linearly decoded from network activations.
Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds cs.LG · 2026-06-18 · unverdicted · none · ref 9 · internal anchor
Compositionality emerges in neural networks only in a narrow depth-connectivity regime, with gradient descent converging to fractured solutions outside it.
From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability cs.LG · 2026-06-16 · unverdicted · none · ref 6 · internal anchor
Derives an upper bound on frozen LM expected risk from proxy risk, SAE reconstruction gap, concept-pool mismatch and sparse complexity, with non-vacuous bounds observed on GPT-2, Gemma-2B and Llama-3-8B.
When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations cs.LG · 2026-06-15 · unverdicted · none · ref 8 · internal anchor
Proposes using sparse autoencoders to extract class-conditioned concept vectors, then measuring logit stability under targeted perturbations as an interpretable OOD signal for deep networks in medical imaging.
Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs cs.CR · 2026-06-08 · unverdicted · none · ref 46 · internal anchor
Adversarial fine-tuning evades activation-based steganography detection in five LLMs while preserving secret recovery, but a recontextualization dataset restores both ridge and MLP probe detectability.
Interactions Between Crosscoder Features: A Compact Proofs Perspective cs.LG · 2026-06-08 · unverdicted · none · ref 13 · internal anchor
Derives an interaction measure between crosscoder features from reconstruction error in compact proofs and applies it to produce computationally sparse crosscoders retaining 60% MLP performance with single-feature selection versus 10% for standard crosscoders.
Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects cs.LG · 2026-06-06 · unverdicted · none · ref 3 · internal anchor
Pre-intervention feature statistics predict SAE steering modularity (stability and collateral spread) better than baselines across multiple models and dictionaries, with model-dependent success in held-out selection.
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs cs.AI · 2026-06-06 · unverdicted · none · ref 14 · internal anchor
Sparse autoencoders identify shared latent features across diverse backdoor attacks in LLMs that enable unified detection via classifiers, causal control via steering, and mitigation via ablation fine-tuning.
TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment cs.CV · 2026-06-05 · unverdicted · none · ref 107 · internal anchor
TEVI applies sparse autoencoders and caption-conditioned masking to edit image embeddings, yielding better retrieval on MS COCO, Flickr, IIW, DOCCI, and RoCOCO benchmarks with larger gains on richer captions.

Toy Models of Superposition

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer