Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations
read the original abstract
Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. We also describe how our group has built a culture of replicability through shared norms and tools that enable rigorous automated testing.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
BracketRank: Large Language Model Document Ranking via Reasoning-based Competitive Elimination
BracketRank reranks documents via LLM-driven bracket-style competitive elimination with mandatory reasoning explanations, reaching 26.56 nDCG@10 on BRIGHT and outperforming RankGPT-4 and Rank-R1-14B.
-
Evaluating LLMs on Real-World Software Performance Optimization
SWE-Pro benchmark shows LLMs deliver negligible runtime gains and almost no memory reductions on 102 real tasks where experts achieve 15.5x aggregate speedup and 171.3x peak memory reduction.
-
SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics
SPECTRA generates reproducible synthetic IR corpora up to 60,000 documents with controllable distractors, long-tail vocabulary, and graded relevance labels via a single-process Python prototype.
-
BiCon-Gate: Consistency-Gated De-colloquialisation for Dialogue Fact-Checking
BiCon-Gate improves dialogue fact-checking by applying staged de-colloquialisation and gating rewrites based on semantic consistency with context, yielding gains on the DialFact benchmark over baselines including LLM ...
-
Mask-to-Correct$^+$: Leveraging Retriever Diversity for Masking-guided Faithful Fact Correction
Mask-to-Correct and M2C+ use diversity-aware masking in RAG to identify erroneous claim spans and produce faithful corrections, outperforming baselines by up to 14% SARI without gold evidence.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.