pith. sign in

arxiv: 2112.09924 · v2 · pith:JA5WCLCOnew · submitted 2021-12-18 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus

classification 💻 cs.CL cs.AIcs.IRcs.LG
keywords knowledgecorpusspheretasksbackgroundchallengescommonki-nlp
0
0 comments X
read the original abstract

In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus to a universal web snapshot. We investigate a slate of NLP tasks which rely on knowledge - either factual or common sense, and ask systems to use a subset of CCNet - the Sphere corpus - as a knowledge source. In contrast to Wikipedia, otherwise a common background corpus in KI-NLP, Sphere is orders of magnitude larger and better reflects the full diversity of knowledge on the web. Despite potential gaps in coverage, challenges of scale, lack of structure and lower quality, we find that retrieval from Sphere enables a state of the art system to match and even outperform Wikipedia-based models on several tasks. We also observe that while a dense index can outperform a sparse BM25 baseline on Wikipedia, on Sphere this is not yet possible. To facilitate further research and minimise the community's reliance on proprietary, black-box search engines, we share our indices, evaluation metrics and infrastructure.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

    cs.AI 2025-06 conditional novelty 6.0

    Active Indexing with synthetic data augmentation for bidirectional fact-source binding during pretraining yields up to 30.2% higher citation precision than passive identifier appending on CitePretrainBench for Qwen models.

  2. Corrective Retrieval Augmented Generation

    cs.CL 2024-01 unverdicted novelty 6.0

    CRAG improves RAG robustness via a retrieval quality evaluator that triggers web augmentation and a decompose-recompose filter to focus on relevant information, yielding better results on short- and long-form generati...

  3. Atlas: Few-shot Learning with Retrieval Augmented Language Models

    cs.CL 2022-08 unverdicted novelty 6.0

    Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.

  4. A Survey of Scaling in Large Language Model Reasoning

    cs.AI 2025-04 unverdicted novelty 3.0

    A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.