KILT: a Benchmark for Knowledge Intensive Language Tasks

Aleksandra Piktus; Angela Fan; Fabio Petroni; James Thorne; Jean Maillard; Majid Yazdani; Nicola De Cao; Patrick Lewis; Sebastian Riedel; Tim Rockt\"aschel

arxiv: 2009.02252 · v4 · pith:DXWWQXX2new · submitted 2020-09-04 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

KILT: a Benchmark for Knowledge Intensive Language Tasks

Fabio Petroni , Aleksandra Piktus , Angela Fan , Patrick Lewis , Majid Yazdani , Nicola De Cao , James Thorne , Yacine Jernite

show 5 more authors

Vladimir Karpukhin Jean Maillard Vassilis Plachouras Tim Rockt\"aschel Sebastian Riedel

This is my paper

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords kiltmodelstasksknowledgeadditionansweringbenchmarkchecking

0 comments

read the original abstract

Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at https://github.com/facebookresearch/KILT.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
cs.AI 2026-05 unverdicted novelty 7.0

Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling...
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
cs.CL 2025-04 conditional novelty 7.0

BrowseComp-ZH is a new benchmark of 289 Chinese web questions where even the strongest LLM agents reach only 42.9% accuracy.
Uncertainty-Aware Hybrid Retrieval for Long-Document RAG
cs.AI 2026-06 unverdicted novelty 6.0

UMG-RAG improves long-document RAG by uncertainty-aware fusion of multi-granularity retrievals from complementary dense and sparse retrievers, plus a parent-promotion variant.
How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation
cs.CL 2026-06 unverdicted novelty 6.0

HieraRAG shows optimal RAG benchmark granularity varies by dimension, with complexity favoring fine-grained categories and a new Coherence Ratio measuring category structure.
ART: Automatic multi-step reasoning and tool-use for large language models
cs.CL 2023-03 unverdicted novelty 6.0

ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
cs.CL 2022-08 unverdicted novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.