Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685, 2020

Alon Jacovi, Yoav Goldberg · 2020 · arXiv 2004.03685

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.

Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

EEG foundation models show no single winner across failure modes, attend to correct brain regions but decode corrupted signals, and retain task information in early layers while late layers adapt during fine-tuning.

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

cs.AI · 2025-03-14 · conditional · novelty 7.0

Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.

Evaluating Multi-turn Human-AI Interaction

cs.HC · 2026-05-18 · unverdicted · novelty 6.0

Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

AtManRL learns an additive attention mask on CoT traces to produce a saliency reward that, when combined with outcome rewards in GRPO, trains LLMs to generate reasoning that genuinely influences final predictions.

Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation

cs.SE · 2025-03-21 · unverdicted · novelty 6.0

CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-person study.

SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation

cs.CL · 2026-05-15 · unverdicted · novelty 4.0

SGR enhances LLM reasoning accuracy by generating external subgraphs from knowledge bases and guiding progressive inference over them, yielding consistent gains over baselines on benchmarks.

citing papers explorer

Showing 8 of 8 citing papers.

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift cs.LG · 2026-05-21 · unverdicted · none · ref 9
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models cs.LG · 2026-05-17 · unverdicted · none · ref 46
EEG foundation models show no single winner across failure modes, attend to correct brain regions but decode corrupted signals, and retain task information in early layers while late layers adapt during fine-tuning.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation cs.AI · 2025-03-14 · conditional · none · ref 27
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
Evaluating Multi-turn Human-AI Interaction cs.HC · 2026-05-18 · unverdicted · none · ref 68
Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 174
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency cs.CL · 2026-04-17 · unverdicted · none · ref 8
AtManRL learns an additive attention mask on CoT traces to produce a saliency reward that, when combined with outcome rewards in GRPO, trains LLMs to generate reasoning that genuinely influences final predictions.
Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation cs.SE · 2025-03-21 · unverdicted · none · ref 25
CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-person study.
SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation cs.CL · 2026-05-15 · unverdicted · none · ref 3
SGR enhances LLM reasoning accuracy by generating external subgraphs from knowledge bases and guiding progressive inference over them, yielding consistent gains over baselines on benchmarks.

Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685, 2020

fields

years

verdicts

representative citing papers

citing papers explorer