The khipu problem frames a governance failure in distributed AI where interpretive continuity is lost even when traces remain, requiring infrastructure to preserve reading practices rather than only data retention.
hub Canonical reference
Towards A Rigorous Science of Interpretable Machine Learning
Canonical reference. 71% of citing Pith papers cite this work as background.
abstract
As machine learning systems become ubiquitous, there has been a surge of interest in interpretable machine learning: systems that provide explanation for their outputs. These explanations are often used to qualitatively assess other criteria such as safety or non-discrimination. However, despite the interest in interpretability, there is very little consensus on what interpretable machine learning is and how it should be measured. In this position paper, we first define interpretability and describe when interpretability is needed (and when it is not). Next, we suggest a taxonomy for rigorous evaluation and expose open questions towards a more rigorous science of interpretable machine learning.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
ISAAC auditing applied to three DTI models on the Davis benchmark finds 25% relative differences in causal reasoning scores despite nearly identical AUROC values.
In-context symbolic regression methods improve robustness of symbolic formula recovery from KANs, cutting median OFAT test MSE by up to 99.8 percent across hyperparameter sweeps.
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
A training-free method using Fourier-parameterized star-convex contours optimized via gradients to generate compact, faithful visual attributions for image classifiers on benchmarks like ImageNet.
A method automatically constructs a causal model from behavior tree structure and domain knowledge to generate real-time causal counterfactual explanations for robot decisions.
SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.
MIMIC is a new inversion framework that recovers visual concepts from VLM internal states using joint inversion, feature alignment, and three regularizers.
Chain-of-thought explanations in LLMs are frequently unfaithful: models systematically omit mention of biasing prompt features that change their answers and instead produce rationalizations for those biased outputs.
Matryoshka Sparse Autoencoders applied to matrix-factorization embeddings recover hierarchical, metadata-aligned features that permit targeted intervention on gender-associated neurons.
Introduces a constraint-satisfaction algorithm and complexity results for recovering linear utilities and latent group bonuses to explain observed rankings under hidden sensitive features.
I-SAFE is a post-hoc auditing framework that applies quantile-based and Wasserstein coherence metrics to evaluate distributional response of DTI prediction models under structural perturbations from external priors like KLIFS annotations.
AI models misalign with humans on concept boundaries when probed with implausible category members, such as classifying words as vehicles or vegetables as fruit.
p-ResNet-50 adds a prototype layer with anchor- and medoid-based regularizations to ResNet-50, achieving ROC-AUC 0.994 and accuracy 0.957 on ~12k XCT patches while supplying case-based explanations aligned to expert categories.
The authors introduce a taxonomy with target, functional role, and mode of justification axes plus a framework that decomposes abstract XAI desiderata into concrete benchmarkable tasks via identified dependency structures.
CLIF applies influence functions to pinpoint influential samples and concepts in CBMs on CEBaB and Yelp datasets, enabling performance restoration via adjustments without retraining.
An entropy criterion on mean representations characterises the polarised regime in VAEs and related models, with theoretical links to KL minimisation and empirical tests across several architectures.
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
ShifaMind achieves competitive performance with the LAAT baseline on MIMIC-IV top-50 ICD-10 coding while outperforming vanilla concept bottleneck models and providing concept-mediated explanations.
The authors introduce the XAI Evaluation Card template to standardize how XAI evaluation metrics are defined, validated, and reported.
citing papers explorer
-
ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction
ISAAC auditing applied to three DTI models on the Davis benchmark finds 25% relative differences in causal reasoning scores despite nearly identical AUROC values.
-
In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks
In-context symbolic regression methods improve robustness of symbolic formula recovery from KANs, cutting median OFAT test MSE by up to 99.8 percent across hyperparameter sweeps.
-
Mechanistic Interpretability with Sparse Autoencoder Neural Operators
SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.
-
I-SAFE: Wasserstein Coherence Metrics for Structural Auditing of Scientific AI Models
I-SAFE is a post-hoc auditing framework that applies quantile-based and Wasserstein coherence metrics to evaluate distributional response of DTI prediction models under structural perturbations from external priors like KLIFS annotations.
-
Entropy-Based Characterisation of the Polarised Regime in Latent Variable Models
An entropy criterion on mean representations characterises the polarised regime in VAEs and related models, with theoretical links to KL minimisation and empirical tests across several architectures.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
ShifaMind: A Multiplicative Concept Bottleneck for Interpretable ICD-10 Coding
ShifaMind achieves competitive performance with the LAAT baseline on MIMIC-IV top-50 ICD-10 coding while outperforming vanilla concept bottleneck models and providing concept-mediated explanations.
-
Towards interpretable AI with quantum annealing feature selection
Quantum annealing solves a combinatorial feature-map selection problem for CNNs, yielding improved class disentanglement over GradCAM and GradCAM++ in the reported evaluation.
-
Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings
In high-stakes settings, Shapley explanations increase analyst confidence but do not improve decision accuracy, and standard metrics fail to predict human utility.
-
Faster Verified Explanations for Neural Networks
FaVeX accelerates verified explanations for neural networks via dynamic batch-sequential processing and query reuse while introducing verifier-optimal robust explanations that incorporate verifier incompleteness.
-
Interpretable and Steerable Sequence Learning via Prototypes
ProSeNet learns a sparse set of prototypes for case-based explanations in deep sequence models, matches state-of-the-art accuracy on several tasks, and supports manual prototype refinement by non-experts.
-
The Price of Interpretability
Introduces a framework for constructing ML models via interpretable steps, generalizes standard proxies into a parametrized family of measures, and quantifies the accuracy-interpretability tradeoff via practical algorithms.
-
Reliability, Faithfulness, and the Limits of Post-hoc Explanations of Opaque Scientific Models
Reliability and faithfulness of post-hoc explanations do not suffice to support claims about how a scientific phenomenon is structured.
-
ChainzRule: Sample-Efficient, Robust Deep Learning Across Tabular, NLP, and Vision Tasks
ChainzRule uses learnable polynomial layers with differential regularization on the Jacobian to promote stable low-frequency representations, claiming improved sample efficiency and robustness on multiple benchmarks.
-
NeuroViz: Real-time Interactive Visualization of Forward and Backward Passes in Neural Network Training
NeuroViz offers interactive real-time visualization of neural network forward and backward passes, achieving top usability scores in a study with 31 participants compared to existing tools.
-
Detection of Real-world Driving-induced Affective State Using Physiological Signals and Multi-view Multi-task Machine Learning
A multi-view multi-task ML method detects real-world driving-induced affective states using physiological signals by modeling inter-drive variability, with results showing performance gains on three datasets.
-
Optimal Explanations of Linear Models
An optimization framework decomposes linear models into increasing-complexity sequences using coordinate updates to generate parametrized interpretability metrics.
-
A Human-Grounded Evaluation of SHAP for Alert Processing
Human-grounded evaluation finds no significant performance improvement from adding SHAP explanations to model confidence scores in alert processing.
-
CW-B: Class Weighted Boosting Framework for Imbalance Resilient Multi Class Cardiac Phenotyping
CW-B is a class-weighted XGBoost method with missingness indicators and classwise auditing that reports best-in-baseline Accuracy, Macro-F1, Balanced Accuracy, and Prioritized F1 on five-class cardiac phenotyping under imbalance.
-
Out of Context: Reliability in Multimodal Anomaly Detection Requires Contextual Inference
Multimodal anomaly detection must be reframed as cross-modal contextual inference that separates context from observations to define abnormality conditionally.
-
Explainable Human Activity Recognition: A Unified Review of Concepts and Mechanisms
The paper delivers a mechanism-centric taxonomy and unified perspective on explainable human activity recognition methods across sensing modalities.
-
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making
An event-centric framework encodes environments as semantic events and retrieves weighted prior maneuvers from a knowledge bank to enable interpretable, physics-aware decision-making for UAVs.
-
Explainability Methods for Hardware Trojan Detection: A Systematic Comparison
Compares domain-aware, case-based, and feature attribution explainability methods for gate-level hardware Trojan detection on the Trust-Hub benchmark dataset.
-
Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health
AIMEN trains an ensemble of neural networks on CTGAN-augmented data to predict adverse labor outcomes at 0.784 F1 and produces sparse counterfactual explanations identifying changes in two to three attributes.