Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations

Andrew Slavin Ross; Finale Doshi-Velez; Michael C. Hughes

arxiv: 1703.03717 · v2 · pith:KXZVITQ2new · submitted 2017-03-10 · 💻 cs.LG · cs.AI· stat.ML

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations

Andrew Slavin Ross , Michael C. Hughes , Finale Doshi-Velez This is my paper

classification 💻 cs.LG cs.AIstat.ML

keywords modelsexplanationsrighttrainingwhenconditionsdatasetsdecision

0 comments

read the original abstract

Neural networks are among the most accurate supervised learning methods in use today, but their opacity makes them difficult to trust in critical applications, especially when conditions in training differ from those in test. Recent work on explanations for black-box models has produced tools (e.g. LIME) to show the implicit rules behind predictions, which can help us identify when models are right for the wrong reasons. However, these methods do not scale to explaining entire datasets and cannot correct the problems they reveal. We introduce a method for efficiently explaining and regularizing differentiable models by examining and selectively penalizing their input gradients, which provide a normal to the decision boundary. We apply these penalties both based on expert annotation and in an unsupervised fashion that encourages diverse models with qualitatively different decision boundaries for the same classification problem. On multiple datasets, we show our approach generates faithful explanations and models that generalize much better when conditions differ between training and test.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study
cs.CV 2026-02 unverdicted novelty 6.0

Benchmark shows that combining data rebalancing with feature disentanglement mitigates shortcut learning more effectively than rebalancing alone in medical imaging models.
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
cs.CL 2023-05 conditional novelty 6.0

Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
Shortcut Mitigation via Spurious-Positive Samples
cs.LG 2026-05 unverdicted novelty 5.0

A method uses spurious-positive samples to identify and regularize neurons that rely on spurious features, improving model robustness without extra annotations or balanced data.
Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs
cs.LG 2025-06 conditional novelty 5.0

Attribution-guided pruning with contrastive relevance identifies behavior-specific circuits in small LLMs and removes as little as 0.03-0.3% of components to reduce toxicity or repetition while preserving general performance.