Interpretable Question Answering on Knowledge Bases and Text

Alona Sydorova; Benjamin Roth; Nina Poerner

arxiv: 1906.10924 · v1 · pith:IPNS7K3Xnew · submitted 2019-06-26 · 💻 cs.CL · cs.AI· cs.LG

Interpretable Question Answering on Knowledge Bases and Text

Alona Sydorova , Nina Poerner , Benjamin Roth This is my paper

Pith reviewed 2026-05-25 16:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords question answeringinterpretabilityexplanation methodsinput perturbationLIMEattention mechanismknowledge basestext documents

0 comments

The pith

Input perturbation provides better explanations for QA models on knowledge bases and text than LIME or attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts two post-hoc explanation techniques, LIME and input perturbation, to question answering models that draw on both knowledge bases and text documents, then pits them against the model's own attention weights. It introduces an automatic evaluation that scores explanations by how well they let a downstream judge pick the stronger of two models. A parallel human study asks annotators to do the same selection task after seeing explanations from each method. Both the automatic scores and the human choices rank input perturbation highest, followed by attention then LIME, and the identical ordering is taken as evidence that the automatic test tracks human judgment. If this ranking holds, developers gain a cheaper way to compare explanation quality without running fresh human studies every time.

Core claim

Input perturbation yields higher-quality explanations than either LIME or the model's attention mechanism when applied to question answering over combined knowledge bases and text; this superiority appears in both an automatic evaluation that measures how well explanations help identify the stronger model and in direct human judgments, and the agreement between the two evaluation routes supports treating the automatic measure as a valid proxy.

What carries the argument

Input perturbation (IP) as a post-hoc method that measures how model outputs change under controlled input changes, applied to hybrid KB-text QA and ranked against LIME and attention via automatic and human selection accuracy.

If this is right

Input perturbation can replace or supplement attention when users need to understand why a KB-text QA system gave a particular answer.
The automatic evaluation lets developers rank new explanation methods without running a full human study for each comparison.
Attention weights alone are not the most reliable signal for interpretability in this class of models.
Post-hoc perturbation methods remain effective even when the underlying model fuses structured KB facts with unstructured text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same automatic paradigm could be applied to other NLP tasks where explanation quality needs cheap, repeatable measurement.
If input perturbation scales to larger models, production QA services might expose perturbation-derived highlights to end users by default.
The human study design could be extended to measure whether explanations also improve users' ability to correct model errors rather than just rank models.

Load-bearing premise

The automatic evaluation measures the same thing as human judgments of which explanations are useful for spotting the better QA model.

What would settle it

A follow-up human study in which annotators given LIME or attention explanations select the stronger model more often than those given input-perturbation explanations would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 1906.10924 by Alona Sydorova, Benjamin Roth, Nina Poerner.

**Figure 2.** Figure 2: Interface for the human annotation study. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Interpretability of machine learning (ML) models becomes more relevant with their increasing adoption. In this work, we address the interpretability of ML based question answering (QA) models on a combination of knowledge bases (KB) and text documents. We adapt post hoc explanation methods such as LIME and input perturbation (IP) and compare them with the self-explanatory attention mechanism of the model. For this purpose, we propose an automatic evaluation paradigm for explanation methods in the context of QA. We also conduct a study with human annotators to evaluate whether explanations help them identify better QA models. Our results suggest that IP provides better explanations than LIME or attention, according to both automatic and human evaluation. We obtain the same ranking of methods in both experiments, which supports the validity of our automatic evaluation paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript adapts LIME and input perturbation (IP) as post-hoc explanation methods for ML-based QA models over combined knowledge bases and text. It introduces an automatic evaluation paradigm for explanation quality in this QA setting, compares the methods against the model's built-in attention, and reports a human study in which annotators use explanations to identify stronger QA models. The central claim is that IP yields better explanations than LIME or attention according to both the automatic metric and human judgments, with identical method rankings taken as evidence that the automatic paradigm is valid.

Significance. If the reported alignment between automatic and human rankings is robust, the work supplies a concrete, reproducible way to rank explanation methods for KB+text QA without requiring new human studies for every model variant. The dual-evaluation design (automatic plus human) is a positive feature that directly addresses the usual difficulty of validating interpretability claims.

major comments (2)

[Abstract] Abstract: the statement that 'we obtain the same ranking of methods in both experiments' is presented without any statistical details (sample sizes, error bars, p-values, or inter-annotator agreement). Because the validity of the automatic paradigm rests entirely on this observed agreement, the absence of these quantities makes it impossible to judge whether the match is reliable or could arise by chance.
[Human study section] Human evaluation (the sole external anchor for the automatic metric): the paper uses annotator decisions about which QA model is better when given explanations as the ground truth for explanation quality. No information is supplied on the number of annotators, how ties or disagreements were handled, or any measure of annotation reliability. This directly affects both the superiority claim for IP and the claim that the automatic paradigm has been validated.

minor comments (1)

[Automatic evaluation paradigm] The precise formulation of the automatic fidelity/perturbation score for the QA setting (how answers are perturbed, how KB facts are masked, etc.) should be stated explicitly with a worked example so that the metric can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The points raised about statistical reporting are valid and we will revise the manuscript to include the requested details on sample sizes, error bars, p-values, annotator numbers, tie handling, and reliability measures.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that 'we obtain the same ranking of methods in both experiments' is presented without any statistical details (sample sizes, error bars, p-values, or inter-annotator agreement). Because the validity of the automatic paradigm rests entirely on this observed agreement, the absence of these quantities makes it impossible to judge whether the match is reliable or could arise by chance.

Authors: We agree that the abstract would benefit from statistical details to support the ranking agreement claim. In the revised manuscript we will add the human study sample size, any variance or error measures from the evaluations, p-values testing the significance of the method ranking agreement (if the underlying data permit), and inter-annotator agreement statistics. These additions will allow readers to assess whether the observed alignment is robust. revision: yes
Referee: [Human study section] Human evaluation (the sole external anchor for the automatic metric): the paper uses annotator decisions about which QA model is better when given explanations as the ground truth for explanation quality. No information is supplied on the number of annotators, how ties or disagreements were handled, or any measure of annotation reliability. This directly affects both the superiority claim for IP and the claim that the automatic paradigm has been validated.

Authors: We concur that explicit reporting of the human evaluation protocol is necessary. The revision will specify the number of annotators, the method used to resolve ties or disagreements (e.g., majority vote), and a reliability metric such as percentage agreement or Fleiss' kappa. These details will strengthen both the claim that input perturbation yields superior explanations and the validation of the automatic evaluation paradigm. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparisons rest on independent automatic and human evaluations

full rationale

The paper adapts existing post-hoc methods (LIME, IP) and compares them to attention via a proposed automatic evaluation paradigm plus a separate human study. Rankings are derived from these external evaluations rather than from any model-internal fitted parameters, self-definitions, or self-citation chains. No equations or derivations reduce one quantity to another by construction; the automatic paradigm is presented as a new proposal whose validity is checked against human annotators, not presupposed. This is a standard empirical setup with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical ML study with no mathematical derivations or new theoretical constructs; it relies on standard assumptions underlying post-hoc explanation methods such as LIME.

axioms (1)

domain assumption Core assumptions of LIME (local linearity and feature independence in perturbations) transfer to the KB+text QA setting without modification.
The paper adapts LIME directly without additional validation of its assumptions for this domain.

pith-pipeline@v0.9.0 · 5666 in / 1258 out tokens · 35245 ms · 2026-05-25T16:00:56.956227+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 5 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Abdalghani Abujabal, Rishiraj Saha Roy, Mohamed Yahya, and Gerhard Weikum. 2017. https://doi.org/10.18653/v1/D17-2011 Quint: Interpretable question answering over knowledge bases . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 61--66. Association for Computational Linguistics

work page doi:10.18653/v1/d17-2011 2017
[4]

Yonatan Bisk, Siva Reddy, John Blitzer, Julia Hockenmaier, and Mark Steedman. 2016. https://doi.org/https://doi.org/10.18653/v1/d16-1214 Evaluating induced ccg parsers on grounded semantic parsing . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX

work page doi:10.18653/v1/d16-1214 2016
[5]

Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor

Kurt D. Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. https://doi.org/https://doi.org/10.1145/1376616.1376746 Freebase: a collaboratively created graph database for structuring human knowledge . In SIGMOD Conference

work page doi:10.1145/1376616.1376746 2008
[6]

Rajarshi Das, Manzil Zaheer, Siva Reddy, and Andrew McCallum. 2017. https://doi.org/10.18653/v1/P17-2057 Question answering on knowledge bases and text using universal schema and memory networks . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 358--365. Association for Computation...

work page doi:10.18653/v1/p17-2057 2017
[7]

Finale Doshi-Velez and Been Kim. 2017. https://arxiv.org/abs/1702.08608 Towards a rigorous science of interpretable machine learning . arXiv preprint. ArXiv:1702.08608v2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya. 2013. https://lemurproject.org/clueweb09/ Facc1: Freebase annotation of clueweb corpora, version 1 (release date 2013-06-26, format version 1, correction level 0)

work page 2013
[9]

Sarthak Jain and Byron C. Wallace. 2019. http://arxiv.org/abs/1902.10186 Attention is not explanation . arXiv preprint. ArXiv:1902.10186

work page internal anchor Pith review Pith/arXiv arXiv 2019
[10]

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. http://arxiv.org/abs/1612.08220 Understanding neural networks through representation erasure . arXiv preprint. ArXiv:1612.08220

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Zachary Chase Lipton. 2018. https://doi.org/10.1145/3236386.3241340 The mythos of model interpretability . Queue, 16(3):30:31--30:57

work page doi:10.1145/3236386.3241340 2018
[12]

Miller, Adam Fisch, Jesse Dodge, Amir - Hossein Karimi, Antoine Bordes, and Jason Weston

Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir - Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. https://doi.org/10.18653/v1/D16-1147 Key-value memory networks for directly reading documents . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1400--1409. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1147 2016
[13]

Selvakumar Murugan, Suriyadeepan Ramamoorthy, Vaidheeswaran Archana, and Malaikannan Sankarasubbu. 2018. https://arxiv.org/abs/1810.12698 Compositional attention networks for interpretability in natural language question answering . arXiv preprint. ArXiv:1810.12698

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Nina Poerner, Benjamin Roth, and Hinrich Sch \" u tze. 2018. https://www.aclweb.org/anthology/papers/P/P18/P18-1032/ Evaluating neural network explanation methods using hybrid documents and morphological agreement . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, pages 340--350, Melbourne,...

work page 2018
[15]

arXiv preprint arXiv:1802.07810, 2018

Forough Poursabzi - Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wortman Vaughan, and Hanna M. Wallach. 2018. http://arxiv.org/abs/1802.07810 Manipulating and measuring model interpretability . arXiv preprint. ArXiv:1802.07810

work page arXiv 2018
[16]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. https://doi.org/10.1145/2939672.2939778 "why should i trust you?": Explaining the predictions of any classifier . In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pages 1135--1144, New York, NY, USA. ACM

work page doi:10.1145/2939672.2939778 2016
[17]

Sebastian Riedel, Limin Yao, Andrew Mccallum, and Benjamin M Marlin. 2013. https://www.aclweb.org/anthology/N13-1008 Relation extraction with matrix factorization and universal schemas . Proceedings of NAACL-HLT 2013, pages 74--84

work page 2013
[18]

Barbara Rychalska, Dominika Basaj, and Przemyslaw Biecek. 2018. http://arxiv.org/abs/1812.02205 Are you tough enough? framework for robustness validation of machine comprehension systems . In Interpretability and Robustness for Audio, Speech and Language Workshop, Montreal, Canada

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. https://doi.org/10.1109/ICCV.2017.74 Grad-cam: Visual explanations from deep networks via gradient-based localization . In 2017 IEEE International Conference on Computer Vision (ICCV), pages 618--626, Venice, Italy

work page doi:10.1109/iccv.2017.74 2017
[20]

Mantong Zhou, Minlie Huang, and Xiaoyan Zhu. 2018. https://www.aclweb.org/anthology/C18-1171 An interpretable reasoning network for multi-relation question answering . In COLING, pages 2010--2022, Sante Fe, USA

work page 2018

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Abdalghani Abujabal, Rishiraj Saha Roy, Mohamed Yahya, and Gerhard Weikum. 2017. https://doi.org/10.18653/v1/D17-2011 Quint: Interpretable question answering over knowledge bases . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 61--66. Association for Computational Linguistics

work page doi:10.18653/v1/d17-2011 2017

[4] [4]

Yonatan Bisk, Siva Reddy, John Blitzer, Julia Hockenmaier, and Mark Steedman. 2016. https://doi.org/https://doi.org/10.18653/v1/d16-1214 Evaluating induced ccg parsers on grounded semantic parsing . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX

work page doi:10.18653/v1/d16-1214 2016

[5] [5]

Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor

Kurt D. Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. https://doi.org/https://doi.org/10.1145/1376616.1376746 Freebase: a collaboratively created graph database for structuring human knowledge . In SIGMOD Conference

work page doi:10.1145/1376616.1376746 2008

[6] [6]

Rajarshi Das, Manzil Zaheer, Siva Reddy, and Andrew McCallum. 2017. https://doi.org/10.18653/v1/P17-2057 Question answering on knowledge bases and text using universal schema and memory networks . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 358--365. Association for Computation...

work page doi:10.18653/v1/p17-2057 2017

[7] [7]

Finale Doshi-Velez and Been Kim. 2017. https://arxiv.org/abs/1702.08608 Towards a rigorous science of interpretable machine learning . arXiv preprint. ArXiv:1702.08608v2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya. 2013. https://lemurproject.org/clueweb09/ Facc1: Freebase annotation of clueweb corpora, version 1 (release date 2013-06-26, format version 1, correction level 0)

work page 2013

[9] [9]

Sarthak Jain and Byron C. Wallace. 2019. http://arxiv.org/abs/1902.10186 Attention is not explanation . arXiv preprint. ArXiv:1902.10186

work page internal anchor Pith review Pith/arXiv arXiv 2019

[10] [10]

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. http://arxiv.org/abs/1612.08220 Understanding neural networks through representation erasure . arXiv preprint. ArXiv:1612.08220

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

Zachary Chase Lipton. 2018. https://doi.org/10.1145/3236386.3241340 The mythos of model interpretability . Queue, 16(3):30:31--30:57

work page doi:10.1145/3236386.3241340 2018

[12] [12]

Miller, Adam Fisch, Jesse Dodge, Amir - Hossein Karimi, Antoine Bordes, and Jason Weston

Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir - Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. https://doi.org/10.18653/v1/D16-1147 Key-value memory networks for directly reading documents . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1400--1409. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1147 2016

[13] [13]

Selvakumar Murugan, Suriyadeepan Ramamoorthy, Vaidheeswaran Archana, and Malaikannan Sankarasubbu. 2018. https://arxiv.org/abs/1810.12698 Compositional attention networks for interpretability in natural language question answering . arXiv preprint. ArXiv:1810.12698

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Nina Poerner, Benjamin Roth, and Hinrich Sch \" u tze. 2018. https://www.aclweb.org/anthology/papers/P/P18/P18-1032/ Evaluating neural network explanation methods using hybrid documents and morphological agreement . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, pages 340--350, Melbourne,...

work page 2018

[15] [15]

arXiv preprint arXiv:1802.07810, 2018

Forough Poursabzi - Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wortman Vaughan, and Hanna M. Wallach. 2018. http://arxiv.org/abs/1802.07810 Manipulating and measuring model interpretability . arXiv preprint. ArXiv:1802.07810

work page arXiv 2018

[16] [16]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. https://doi.org/10.1145/2939672.2939778 "why should i trust you?": Explaining the predictions of any classifier . In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pages 1135--1144, New York, NY, USA. ACM

work page doi:10.1145/2939672.2939778 2016

[17] [17]

Sebastian Riedel, Limin Yao, Andrew Mccallum, and Benjamin M Marlin. 2013. https://www.aclweb.org/anthology/N13-1008 Relation extraction with matrix factorization and universal schemas . Proceedings of NAACL-HLT 2013, pages 74--84

work page 2013

[18] [18]

Barbara Rychalska, Dominika Basaj, and Przemyslaw Biecek. 2018. http://arxiv.org/abs/1812.02205 Are you tough enough? framework for robustness validation of machine comprehension systems . In Interpretability and Robustness for Audio, Speech and Language Workshop, Montreal, Canada

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. https://doi.org/10.1109/ICCV.2017.74 Grad-cam: Visual explanations from deep networks via gradient-based localization . In 2017 IEEE International Conference on Computer Vision (ICCV), pages 618--626, Venice, Italy

work page doi:10.1109/iccv.2017.74 2017

[20] [20]

Mantong Zhou, Minlie Huang, and Xiaoyan Zhu. 2018. https://www.aclweb.org/anthology/C18-1171 An interpretable reasoning network for multi-relation question answering . In COLING, pages 2010--2022, Sante Fe, USA

work page 2018