Let's measure run time! Extending the IR replicability infrastructure to include performance aspects

Allan Hanbury; Sebastian Hofst\"atter

arxiv: 1907.04614 · v1 · pith:LWG2ZR37new · submitted 2019-07-10 · 💻 cs.IR

Let's measure run time! Extending the IR replicability infrastructure to include performance aspects

Sebastian Hofst\"atter , Allan Hanbury This is my paper

Pith reviewed 2026-05-24 23:34 UTC · model grok-4.3

classification 💻 cs.IR

keywords information retrievalneural IRreplicabilityruntime measurementquery latencyre-ranking modelsbenchmark scenariosdocker infrastructure

0 comments

The pith

Extending the docker-based replicability infrastructure with two performance benchmark scenarios enables consistent measurement of query run times for neural IR systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that docker-based replicability setups create an opportunity to track run times in addition to effectiveness scores. Neural re-ranking models face a clear speed-effectiveness trade-off, where complex models like BERT deliver high accuracy at the cost of much higher latency than simpler alternatives. Including runtime in standardized evaluations would push the community to weigh practical query response times when proposing new models. The authors support the argument with a case study of model run times and outline two concrete benchmark scenarios for the existing infrastructure.

Core claim

The authors state that user satisfaction depends on the time required to present query results, and that recent neural IR advances bring latency issues to the forefront through a complex trade-off involving encoding models, network architecture, and hardware. They argue that extending the replicability infrastructure with performance-focused scenarios will broaden community focus to sustain practical applicability of innovations.

What carries the argument

Two performance-focused benchmark scenarios added to the existing docker-based replicability infrastructure for measuring and comparing run times.

If this is right

Run times of different neural re-ranking models can be measured and compared in a replicable way.
The impact of choices like network architecture or hardware acceleration on latency becomes quantifiable.
Simpler models may be favored when both effectiveness and speed are evaluated together.
Query latency will be treated as a first-class concern in model development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar performance tracking could be added to replicability efforts in other machine learning domains that rely on docker containers.
The scenarios might encourage development of hybrid models that optimize both accuracy and speed explicitly.
Adoption could shift community norms so that latency numbers become expected in neural IR publications.

Load-bearing premise

That standardizing runtime measurements inside the replicability infrastructure will cause researchers to prioritize performance considerations alongside effectiveness.

What would settle it

After the scenarios are added, check whether papers using the infrastructure begin reporting and optimizing for measured run times rather than effectiveness alone.

Figures

Figures reproduced from arXiv: 1907.04614 by Allan Hanbury, Sebastian Hofst\"atter.

**Figure 2.** Figure 2: A simplified query workflow with re-ranking – showing the reach of our proposed performance benchmarks [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Establishing a docker-based replicability infrastructure offers the community a great opportunity: measuring the run time of information retrieval systems. The time required to present query results to a user is paramount to the users satisfaction. Recent advances in neural IR re-ranking models put the issue of query latency at the forefront. They bring a complex trade-off between performance and effectiveness based on a myriad of factors: the choice of encoding model, network architecture, hardware acceleration and many others. The best performing models (currently using the BERT transformer model) run orders of magnitude more slowly than simpler architectures. We aim to broaden the focus of the neural IR community to include performance considerations -- to sustain the practical applicability of our innovations. In this position paper we supply our argument with a case study exploring the performance of different neural re-ranking models. Finally, we propose to extend the OSIRRC docker-based replicability infrastructure with two performance focused benchmark scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A modest proposal to add runtime tracking to OSIRRC, illustrated by a neural re-ranking case study but without strong evidence it will shift priorities.

read the letter

The paper's core move is to extend the existing OSIRRC Docker replicability setup with two new benchmark scenarios that capture runtime alongside effectiveness. They back this with a case study on latency differences across neural re-ranking models, noting that BERT-based ones are much slower than simpler alternatives and that this matters for user experience. That framing is reasonable and directly tied to a real tension in recent neural IR work. The suggestion itself is incremental but practical: reuse the container infrastructure rather than build something new from scratch. Credit to the authors for identifying the gap and offering a concrete next step instead of just complaining about it. The case study is the part that could make the argument land, assuming it shows measurable differences that matter in practice. The manuscript does not overclaim; it presents the extension as an enabling change rather than a guaranteed fix. The main limitation is that this remains a position paper. The assumption that standardizing runtime numbers will cause the community to weigh performance more heavily is plausible but untested here, and the case study details are only referenced, not walked through with numbers or setup specifics. No circular reasoning or invented metrics appear in the argument. The proposal does not require new theory or hardware claims that would need extra validation. Readers already involved with OSIRRC or running neural re-rankers would get the most out of it, as it gives them a ready path to add latency data to their replications. Pure effectiveness papers or work outside neural IR would find little to use. It is worth sending for peer review because the infrastructure change is well-scoped, the motivation is grounded in current model behavior, and referees could usefully comment on the exact benchmark definitions before any implementation effort begins.

Referee Report

2 major / 0 minor

Summary. The manuscript is a position paper arguing that runtime measurement is important for neural IR systems due to latency-effectiveness trade-offs (e.g., BERT-based re-rankers). It proposes extending the existing OSIRRC Docker replicability infrastructure with two performance-focused benchmark scenarios and supplies supporting evidence via a case study on neural re-ranking model performance.

Significance. If the proposed infrastructure extension is adopted and the case study demonstrates clear latency differences, the work could help shift community norms toward joint consideration of effectiveness and efficiency, improving the practical applicability of neural IR innovations. The concrete proposal to reuse an existing Docker framework is a strength.

major comments (2)

[proposal paragraph] The two performance-focused benchmark scenarios are referenced in the proposal but never defined (what metrics, integration points with OSIRRC, hardware assumptions, or output formats). This detail is load-bearing for the central claim that the extension is feasible and useful.
[case study] The case study on neural re-ranking latency is invoked to support the argument but is not described (models, hardware, measurement protocol, or quantitative results). Without these details the manuscript cannot demonstrate that the proposed scenarios would address a real community need.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and constructive feedback on our position paper. We address the two major comments below, agreeing that additional detail is warranted to strengthen the proposal.

read point-by-point responses

Referee: [proposal paragraph] The two performance-focused benchmark scenarios are referenced in the proposal but never defined (what metrics, integration points with OSIRRC, hardware assumptions, or output formats). This detail is load-bearing for the central claim that the extension is feasible and useful.

Authors: We agree that the manuscript references the two scenarios at a high level without providing the requested specifications. As this is a position paper, the original intent was to outline the conceptual extension rather than deliver a complete specification. However, we recognize that concrete details would better demonstrate feasibility. In the revised manuscript we will expand the proposal section to define the scenarios, including suggested metrics (latency and throughput), integration points with the existing OSIRRC Docker framework, hardware assumptions, and compatible output formats. revision: yes
Referee: [case study] The case study on neural re-ranking latency is invoked to support the argument but is not described (models, hardware, measurement protocol, or quantitative results). Without these details the manuscript cannot demonstrate that the proposed scenarios would address a real community need.

Authors: The case study is referenced to illustrate the latency-effectiveness trade-offs in neural re-ranking, but we acknowledge that the current text does not supply the requested specifics on models, hardware, protocol, or results. To more convincingly show that the proposed benchmarks address a community need, we will revise the manuscript to include a fuller description of the case study with these details. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a position paper advancing a concrete proposal to add two performance-focused benchmark scenarios to the existing OSIRRC Docker infrastructure, supported by a descriptive case study on neural re-ranking latency. It advances no derivations, equations, fitted parameters, or first-principles predictions whose outputs reduce to their inputs by construction. No self-citations function as load-bearing premises for any technical claim, and the aspirational statement about future community priorities is presented as motivation rather than a precondition or derived result. The argument is therefore self-contained as an infrastructure-extension proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the domain assumption that runtime is a primary user-satisfaction factor and that current neural IR work overlooks it; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The time required to present query results to a user is paramount to the users satisfaction.
Stated directly in the abstract as the motivation for measuring runtime.

pith-pipeline@v0.9.0 · 5685 in / 1107 out tokens · 19780 ms · 2026-05-24T23:34:54.728521+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 5 internal anchors

[1]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew Mcnamara, Bhaskar Mitra, and Tri Nguyen. 2016. MS MARCO : A Human Generated MAchine Reading COmprehension Dataset. In Proc. of NIPS

work page 2016
[2]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Tr. of the ACL 5 (2017)

work page 2017
[3]

Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. InProc. of WSDM

work page 2018
[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, et al. 2017. AllenNLP: A Deep Semantic Natural Language Processing Platform. arXiv:arXiv:1803.07640

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Sebastian Hofstätter, Navid Rekabsaz, Carsten Eickhoff, and Allan Hanbury. 2019. On the Effect of Low-Frequency Terms on Neural-IR Models. In Proc. of SIGIR. 4

work page 2019
[7]

Sebastian Hofstätter, Navid Rekabsaz, Mihai Lupu, Carsten Eickhoff, and Allan Hanbury. 2019. Enriching Word Embeddings for Patent Retrieval with Global Context. In Proc. of ECIR

work page 2019
[8]

Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, et al . 2017. Speed/accuracy trade-offs for modern convolutional object detectors. In Proc. of the IEEE-CVPR

work page 2017
[9]

Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. PACRR: A Position-Aware Neural IR Model for Relevance Matching. In Proc. of EMNLP

work page 2017
[10]

Oscar Jimenez-del Toro, Henning Müller, Markus Krenn, et al . 2016. Cloud- based evaluation of anatomical structure segmentation and landmark detection algorithms: VISCERAL anatomy benchmarks. IEEE trans. on Med. Imaging (2016)

work page 2016
[11]

Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR: Contextualized Embeddings for Document Ranking. In SIGIR

work page 2019
[12]

Bhaskar Mitra and Nick Craswell. 2019. An Updated Duet Model for Passage Re-ranking. arXiv preprint arXiv:1903.07666 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[13]

Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[14]

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv preprint arXiv:1904.08375 (2019)

work page arXiv 2019
[15]

Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng

work page
[16]

In Proc of

Text Matching as Image Recognition. In Proc of. AAAI

work page
[17]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, et al. 2017. Auto- matic differentiation in PyTorch. InNIPS-W

work page 2017
[18]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proc of EMNLP

work page 2014
[19]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proc. of the IEEE-CVPR

work page 2016
[20]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proc. of NIPS

work page 2015
[21]

Jaime Teevan, Kevyn Collins-Thompson, Ryen W White, Susan T Dumais, and Yubin Kim. 2013. Slow search: Information retrieval without time constraints. In Proc. of the Symposium on HCI and IR

work page 2013
[22]

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv preprint arXiv:1905.09418 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[23]

Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In Proc. of SIGIR

work page 2017
[24]

Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the use of Lucene for information retrieval research. In Proc. of SIGIR. 5

work page 2017

[1] [1]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew Mcnamara, Bhaskar Mitra, and Tri Nguyen. 2016. MS MARCO : A Human Generated MAchine Reading COmprehension Dataset. In Proc. of NIPS

work page 2016

[2] [2]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Tr. of the ACL 5 (2017)

work page 2017

[3] [3]

Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. InProc. of WSDM

work page 2018

[4] [4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, et al. 2017. AllenNLP: A Deep Semantic Natural Language Processing Platform. arXiv:arXiv:1803.07640

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Sebastian Hofstätter, Navid Rekabsaz, Carsten Eickhoff, and Allan Hanbury. 2019. On the Effect of Low-Frequency Terms on Neural-IR Models. In Proc. of SIGIR. 4

work page 2019

[7] [7]

Sebastian Hofstätter, Navid Rekabsaz, Mihai Lupu, Carsten Eickhoff, and Allan Hanbury. 2019. Enriching Word Embeddings for Patent Retrieval with Global Context. In Proc. of ECIR

work page 2019

[8] [8]

Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, et al . 2017. Speed/accuracy trade-offs for modern convolutional object detectors. In Proc. of the IEEE-CVPR

work page 2017

[9] [9]

Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. PACRR: A Position-Aware Neural IR Model for Relevance Matching. In Proc. of EMNLP

work page 2017

[10] [10]

Oscar Jimenez-del Toro, Henning Müller, Markus Krenn, et al . 2016. Cloud- based evaluation of anatomical structure segmentation and landmark detection algorithms: VISCERAL anatomy benchmarks. IEEE trans. on Med. Imaging (2016)

work page 2016

[11] [11]

Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR: Contextualized Embeddings for Document Ranking. In SIGIR

work page 2019

[12] [12]

Bhaskar Mitra and Nick Craswell. 2019. An Updated Duet Model for Passage Re-ranking. arXiv preprint arXiv:1903.07666 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[13] [13]

Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[14] [14]

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv preprint arXiv:1904.08375 (2019)

work page arXiv 2019

[15] [15]

Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng

work page

[16] [16]

In Proc of

Text Matching as Image Recognition. In Proc of. AAAI

work page

[17] [17]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, et al. 2017. Auto- matic differentiation in PyTorch. InNIPS-W

work page 2017

[18] [18]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proc of EMNLP

work page 2014

[19] [19]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proc. of the IEEE-CVPR

work page 2016

[20] [20]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proc. of NIPS

work page 2015

[21] [21]

Jaime Teevan, Kevyn Collins-Thompson, Ryen W White, Susan T Dumais, and Yubin Kim. 2013. Slow search: Information retrieval without time constraints. In Proc. of the Symposium on HCI and IR

work page 2013

[22] [22]

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv preprint arXiv:1905.09418 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[23] [23]

Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In Proc. of SIGIR

work page 2017

[24] [24]

Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the use of Lucene for information retrieval research. In Proc. of SIGIR. 5

work page 2017