pith. sign in

hub

Automatically interpreting millions of features in large language models

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

hub tools

citation-role summary

background 1

citation-polarity summary

years

2026 10 2025 2

roles

background 1

polarities

background 1

representative citing papers

Are Sparse Autoencoder Benchmarks Reliable?

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.

Why Retrieval-Augmented Generation Fails: A Graph Perspective

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

Features have life history. And we should care

q-bio.NC · 2026-05-07 · unverdicted · novelty 5.0

Language model features form an early stable carrier scaffold of about 50 sparse features that is load-bearing, predictable from onset firing, and recruits most later features.

citing papers explorer

Showing 12 of 12 citing papers.