The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus

Aleksandra Piktus; Barlas O\u{g}uz; Dmytro Okhonko; Edouard Grave; Fabio Petroni; Gautier Izacard; Patrick Lewis; Samuel Broscheit; Sebastian Riedel; Vladimir Karpukhin

arxiv: 2112.09924 · v2 · pith:JA5WCLCOnew · submitted 2021-12-18 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus

Aleksandra Piktus , Fabio Petroni , Vladimir Karpukhin , Dmytro Okhonko , Samuel Broscheit , Gautier Izacard , Patrick Lewis , Barlas O\u{g}uz

show 3 more authors

Edouard Grave Wen-tau Yih Sebastian Riedel

This is my paper

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords knowledgecorpusspheretasksbackgroundchallengescommonki-nlp

0 comments

read the original abstract

In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus to a universal web snapshot. We investigate a slate of NLP tasks which rely on knowledge - either factual or common sense, and ask systems to use a subset of CCNet - the Sphere corpus - as a knowledge source. In contrast to Wikipedia, otherwise a common background corpus in KI-NLP, Sphere is orders of magnitude larger and better reflects the full diversity of knowledge on the web. Despite potential gaps in coverage, challenges of scale, lack of structure and lower quality, we find that retrieval from Sphere enables a state of the art system to match and even outperform Wikipedia-based models on several tasks. We also observe that while a dense index can outperform a sparse BM25 baseline on Wikipedia, on Sphere this is not yet possible. To facilitate further research and minimise the community's reliance on proprietary, black-box search engines, we share our indices, evaluation metrics and infrastructure.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
cs.AI 2025-06 conditional novelty 6.0

Active Indexing with synthetic data augmentation for bidirectional fact-source binding during pretraining yields up to 30.2% higher citation precision than passive identifier appending on CitePretrainBench for Qwen models.
Corrective Retrieval Augmented Generation
cs.CL 2024-01 unverdicted novelty 6.0

CRAG improves RAG robustness via a retrieval quality evaluator that triggers web augmentation and a decompose-recompose filter to focus on relevant information, yielding better results on short- and long-form generati...
Atlas: Few-shot Learning with Retrieval Augmented Language Models
cs.CL 2022-08 unverdicted novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
A Survey of Scaling in Large Language Model Reasoning
cs.AI 2025-04 unverdicted novelty 3.0

A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.