pith. sign in

arxiv: 1807.09623 · v1 · pith:ZCVPS77Hnew · submitted 2018-07-25 · 💻 cs.CL · cs.AI· cs.LG

Repartitioning of the ComplexWebQuestions Dataset

classification 💻 cs.CL cs.AIcs.LG
keywords trainingcomplexwebquestionsmodelanswerberantdatadatasetleakage
0
0 comments X
read the original abstract

Recently, Talmor and Berant (2018) introduced ComplexWebQuestions - a dataset focused on answering complex questions by decomposing them into a sequence of simpler questions and extracting the answer from retrieved web snippets. In their work the authors used a pre-trained reading comprehension (RC) model (Salant and Berant, 2018) to extract the answer from the web snippets. In this short note we show that training a RC model directly on the training data of ComplexWebQuestions reveals a leakage from the training set to the test set that allows to obtain unreasonably high performance. As a solution, we construct a new partitioning of ComplexWebQuestions that does not suffer from this leakage and publicly release it. We also perform an empirical evaluation on these two datasets and show that training a RC model on the training data substantially improves state-of-the-art performance.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.