A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Evaluating superhuman models with consistency checks
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
ArgLLMs build argumentation frameworks from LLMs to support explainable and contestable formal reasoning for claim verification.
citing papers explorer
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Argumentative Large Language Models for Explainable and Contestable Claim Verification
ArgLLMs build argumentation frameworks from LLMs to support explainable and contestable formal reasoning for claim verification.