Careful Selection of Knowledge to solve Open Book Question Answering

Arindam Mitra; Chitta Baral; Kuntal Kumar Pal; Pratyay Banerjee

arxiv: 1907.10738 · v1 · pith:NBSIJOVUnew · submitted 2019-07-24 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

Careful Selection of Knowledge to solve Open Book Question Answering

Pratyay Banerjee , Kuntal Kumar Pal , Arindam Mitra , Chitta Baral This is my paper

Pith reviewed 2026-05-24 16:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords OpenBookQAopen book question answeringabductive information retrievalinformation gain re-rankingpassage selectionlanguage modelscommon knowledge reasoning

0 comments

The pith

Combining language models with abductive retrieval, information-gain re-ranking, passage selection and weighted scoring reaches 72% accuracy on OpenBookQA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

OpenBookQA tests question answering that needs both linguistic parsing and reasoning over common knowledge facts supplied in an open book. The paper demonstrates that language models by themselves are insufficient for this task but that augmenting them with abductive information retrieval to locate candidate facts, followed by information-gain re-ranking, targeted passage selection and weighted scoring, produces 72.0 percent accuracy. This result improves the prior state of the art by 11.6 percentage points. A sympathetic reader would care because the work isolates the contribution of careful knowledge selection rather than model scale alone.

Core claim

The authors establish that state-of-the-art language models combined with abductive information retrieval, information gain based re-ranking, passage selection and weighted scoring achieve 72.0 percent accuracy on the OpenBookQA dataset, an 11.6 percent improvement over the previous best result.

What carries the argument

Pipeline of abductive information retrieval followed by information-gain re-ranking, passage selection and weighted scoring to select and combine relevant facts from the open book.

If this is right

The same retrieval-plus-scoring steps can be added to other language-model QA systems that must incorporate external facts.
Information-gain re-ranking provides a concrete mechanism for preferring passages that reduce uncertainty about the answer.
Weighted scoring of multiple selected passages improves robustness when individual facts are only partially relevant.
The approach separates the contribution of knowledge selection from the base language model, allowing either component to be swapped.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar retrieval pipelines could be embedded inside future language models so that the selection steps no longer require an external stage.
The method may transfer to other science-question datasets that supply background facts but still demand common-knowledge links.
If the information-gain criterion proves general, it offers a parameter-light alternative to learned re-rankers.

Load-bearing premise

The accuracy gain is produced by the listed retrieval and scoring steps rather than by hidden data leakage or post-hoc tuning that affects the reported numbers.

What would settle it

Re-running the experiments after removing the abductive retrieval, re-ranking or weighted scoring components and obtaining accuracy below 60 percent on the same test set would falsify the central claim.

read the original abstract

Open book question answering is a type of natural language based QA (NLQA) where questions are expected to be answered with respect to a given set of open book facts, and common knowledge about a topic. Recently a challenge involving such QA, OpenBookQA, has been proposed. Unlike most other NLQA tasks that focus on linguistic understanding, OpenBookQA requires deeper reasoning involving linguistic understanding as well as reasoning with common knowledge. In this paper we address QA with respect to the OpenBookQA dataset and combine state of the art language models with abductive information retrieval (IR), information gain based re-ranking, passage selection and weighted scoring to achieve 72.0% accuracy, an 11.6% improvement over the current state of the art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a solid 11.6-point lift on OpenBookQA by stacking abductive IR and information-gain re-ranking on language models, but the abstract gives no experimental protocol and the supervised re-ranking step carries an unaddressed leakage risk.

read the letter

The headline result is 72% accuracy on OpenBookQA, an 11.6-point gain over prior work, obtained by combining language models with abductive information retrieval, information-gain re-ranking, passage selection, and weighted scoring. That numeric improvement is the only concrete new thing the abstract offers; the individual pieces are already in the literature. The paper does a reasonable job of framing why OpenBookQA needs both linguistic and common-knowledge reasoning and of naming the components they added on top of the base models. Beyond the number itself, there is little else that looks original or first-principles derived. The soft spots are straightforward. The abstract supplies no ablation table, no error bars, no dataset-split description, and no pseudocode for how the information-gain scores or final weights were computed. Information gain is a supervised quantity, so the re-ranking step requires labeled examples. Nothing in the provided text states that this calculation was performed only on the training split or that test questions and answers were kept out of the gain computation and the weighting coefficients. The stress-test concern therefore lands: without that assurance the reported lift cannot be treated as a clean generalization result. The work is aimed at researchers already running experiments on OpenBookQA or similar retrieval-heavy QA tasks. A reader in that narrow group might extract a useful baseline number if the full paper later supplies the missing protocol and confirms the split was respected. On current evidence the paper is not ready for a serious referee; the central claim cannot be evaluated without the experimental details. I would not bring it to a reading group or cite it until those details appear.

Referee Report

2 major / 1 minor

Summary. The paper claims to address open-book QA on the OpenBookQA dataset by combining state-of-the-art language models with abductive information retrieval, information-gain-based re-ranking, passage selection, and weighted scoring, achieving 72.0% accuracy (an 11.6% improvement over prior SOTA).

Significance. If the reported accuracy is reproducible without test-set leakage, the work would show that targeted knowledge selection and supervised re-ranking can meaningfully boost performance on a dataset that requires both linguistic parsing and common-knowledge reasoning. The combination of abductive IR with information-gain re-ranking is a concrete, testable recipe that could be adopted or ablated by others.

major comments (2)

[Abstract / method description] Abstract and method description: the 72.0% test accuracy is obtained via an information-gain re-ranking step whose computation is never stated to be restricted to the training split. Because information gain is a supervised quantity that requires labeled answers, any use of test questions or test labels during re-ranking or weighting would render the generalization claim invalid; no equation, pseudocode, or split statement is supplied that would allow a reader to confirm the split was respected.
[Abstract] Abstract: the central numeric claim (72.0% accuracy, 11.6% lift) is presented with no mention of train/dev/test splits, error bars, ablation results, or baseline implementation details, making it impossible to assess whether the improvement is supported by the experiments.

minor comments (1)

[Abstract] The abstract would be clearer if it named the prior SOTA system that is being improved by 11.6%.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying points where additional clarity is needed regarding data splits and experimental reporting. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / method description] Abstract and method description: the 72.0% test accuracy is obtained via an information-gain re-ranking step whose computation is never stated to be restricted to the training split. Because information gain is a supervised quantity that requires labeled answers, any use of test questions or test labels during re-ranking or weighting would render the generalization claim invalid; no equation, pseudocode, or split statement is supplied that would allow a reader to confirm the split was respected.

Authors: We agree the manuscript does not explicitly document the split restriction for the information-gain step. In the reported experiments the information-gain scores and all supervised re-ranking weights were computed exclusively on the training split; test questions and labels were never used. We will insert an explicit statement, the relevant equation, and pseudocode in the revised method section to make this unambiguous. revision: yes
Referee: [Abstract] Abstract: the central numeric claim (72.0% accuracy, 11.6% lift) is presented with no mention of train/dev/test splits, error bars, ablation results, or baseline implementation details, making it impossible to assess whether the improvement is supported by the experiments.

Authors: The abstract is intentionally concise. The body of the paper uses the canonical OpenBookQA train/dev/test splits and reports the 72.0% figure on the official test set. We will expand the experimental section with error bars across multiple random seeds, complete ablation tables, and precise baseline re-implementation details. Because of length constraints we will not alter the abstract itself but will ensure the results section fully supports the headline numbers. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical accuracy report with no derivation chain

full rationale

The paper reports an empirical result (72.0% accuracy via abductive IR + information-gain re-ranking + passage selection + weighted scoring on OpenBookQA) without any equations, fitted parameters presented as predictions, self-definitional steps, or load-bearing self-citations. The abstract and provided text contain no claimed first-principles derivation that reduces to the authors' own inputs by construction. Standard ML reporting of test-set performance on a public benchmark does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the ledger is therefore empty.

pith-pipeline@v0.9.0 · 5667 in / 1038 out tokens · 19415 ms · 2026-05-24T16:34:20.023903+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

combine state of the art language models with abductive information retrieval (IR), information gain based re-ranking, passage selection and weighted scoring
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Information Gain based Re-ranking to remove redundant information

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

ADMM-Q is a new post-training quantization method using ADMM operator splitting that reduces WikiText-2 perplexity compared to GPTQ on Qwen3-8B across W3A16, W4A8, and W2A4KV4 settings.