Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.
Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness?
17 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
Proposes SCSuff metric for evaluating LLM explanation sufficiency via model-generated alternative inputs, showing explanations are typically insufficient and predictable from hidden states.
Pruning attention layers in five LLMs across eight datasets maintains accuracy but degrades faithfulness and calibration.
Sgatlin replaces transformer FF layers with sparse single linear neurons, improving perplexity across compute budgets and enabling direct interpretation of semantically clustered circuits for factual recall.
A new framework quantifies faithful confidence expression in large reasoning models by comparing linguistic decisiveness to token probabilities, hidden states, and response consistency, revealing it as a persistent challenge.
Summary reasoning traces from LLMs maintain task performance and increase trust and appeal relative to answer-only or full-trace conditions, but none of the formats improve users' metacognitive calibration on reasoning tasks.
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.
Joint NMF and binomial regression learns response-relevant text signals with competitive performance on simulations and review data.
ECPO is a listwise policy optimization method that couples ranking utility with span-level evidence certificate validity and a deterministic verifier reward on MAVEN-ERE and RAMS datasets.
NEURON integrates SNOMED CT, ML, and RAG LLM to raise AUC from 0.74-0.77 to 0.84-0.88 and human-aligned explainability scores from 0.50 to 0.85 on MIMIC-IV acute heart failure data.
Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.
LLMs support decision prediction and rationale generation but lack evidence for genuine decision explanation, requiring stricter standards to avoid over-crediting.
citing papers explorer
No citing papers match the current filters.