Trie-based Experiment Plans for Efficient IR Pipeline Experiments
Pith reviewed 2026-07-02 06:25 UTC · model grok-4.3
The pith
A trie for experiment plans shares intermediate results across retrieval pipeline variants to reduce total evaluation time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Formulating an experiment plan as a trie allows repeated sub-computations in different pipeline variants to be executed only once, which shortens overall runtime for comparative evaluations of cascading retrieval systems while preserving the measured effectiveness of each pipeline.
What carries the argument
The trie data structure that represents experiment plans by modeling branching points where pipeline variants diverge, thereby enabling reuse of shared prefix computations.
If this is right
- Pipeline variants that share an initial retriever incur the cost of that retriever only once.
- The measured effectiveness of each complete pipeline remains identical to a sequential execution.
- More pipeline variants can be compared within a fixed time or compute budget.
- The technique applies to any declarative pipeline description supported by PyTerrier.
Where Pith is reading between the lines
- The same sharing principle could be applied to multi-stage pipelines that feed documents into large language models for retrieval-augmented generation.
- Controlled studies of how early-stage recall affects final precision could test many more variants at lower cost.
- Industrial A/B testing of ranking changes might reuse common retrieval stages across candidate configurations.
Load-bearing premise
The trie representation of experiment plans can share intermediate results across pipeline variants without changing the measured recall or precision values or introducing ordering artifacts.
What would settle it
Run the BM25-MonoT5-DuoT5 experiment plan on MSMARCO v2 once with a linear schedule and once with the trie planner on identical hardware, then check whether effectiveness scores match exactly and whether the trie version finishes measurably sooner.
Figures
read the original abstract
Search engines are often formulated as cascading pipelines, where successive stages combine the results of different retrievers, and iteratively refine the ranking of candidate documents to obtain a final ranking, which can be presented to a user, or provided as context to an LLM. Such pipelines can be complex to evaluate in an end-to-end manner, necessitating measurement of Recall of early stages, and Precision of later stages, which are often interchangeable. PyTerrier is ideal for building and evaluating cascading retrieval pipelines, due to its declarative nature for pipeline construction and wide ecosystem of retrievers and rerankers. However, comparative evaluation of pipelines can be expensive due to repeated components. In this work, we describe the use of a trie data structure to formulate an experiment plan for comparative pipeline experiments that enhances experiment efficiency compared to a sequential "linear" plan. Empirically, on a demonstration experiment involving BM25, MonoT5 and DuoT5 on MSMARCO v2, we observe a 26% reduction in experiment duration. Finally, we report on a user study of undergraduate and postgraduate research students' use of the experiment plans.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes representing comparative IR pipeline experiments as a trie to share common prefixes (e.g., initial BM25 retrieval) across variants, thereby reducing redundant computation relative to a linear sequential plan. On a demonstration using BM25, MonoT5 and DuoT5 pipelines evaluated on MSMARCO v2, it reports a 26% reduction in total experiment duration and presents results from a user study of student researchers.
Significance. If the trie-based plans are shown to produce exactly the same per-pipeline recall and precision values as linear execution, the approach would offer a practical efficiency gain for the common task of comparing cascading retrieval pipelines in PyTerrier and similar frameworks.
major comments (2)
- [Abstract (empirical demonstration paragraph)] The abstract's central empirical claim of a 26% duration reduction provides no measurement protocol, controls for external caching, number of repetitions, or error bars. Because this number is the primary evidence for the efficiency benefit, the absence of these details is load-bearing.
- [Trie formulation (paragraph describing the trie)] The trie formulation implicitly assumes that shared intermediate rankings and scores can be reused across pipeline variants without altering recall or precision or introducing ordering artifacts. The manuscript must explicitly verify and tabulate that every pipeline's final metrics are identical under trie versus linear execution; any divergence would make the reported speedup incomparable to standard practice.
minor comments (1)
- [User study] The user-study section should report the exact number of participants, task instructions, and any quantitative metrics collected.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the empirical reporting and verification of result equivalence.
read point-by-point responses
-
Referee: [Abstract (empirical demonstration paragraph)] The abstract's central empirical claim of a 26% duration reduction provides no measurement protocol, controls for external caching, number of repetitions, or error bars. Because this number is the primary evidence for the efficiency benefit, the absence of these details is load-bearing.
Authors: We agree that the abstract lacks sufficient methodological detail on the timing experiment. In the revised manuscript we will expand the relevant abstract sentence to specify: (i) the measurement protocol (wall-clock time of the full experiment plan including all pipeline stages), (ii) controls (external caches disabled and a fresh JVM/Python process for each plan), (iii) number of repetitions (five independent runs), and (iv) variability (mean and standard deviation across runs, yielding 26% ± 3% reduction). These details already exist in our internal experimental logs and will be added without altering the reported central figure. revision: yes
-
Referee: [Trie formulation (paragraph describing the trie)] The trie formulation implicitly assumes that shared intermediate rankings and scores can be reused across pipeline variants without altering recall or precision or introducing ordering artifacts. The manuscript must explicitly verify and tabulate that every pipeline's final metrics are identical under trie versus linear execution; any divergence would make the reported speedup incomparable to standard practice.
Authors: We accept this point. Although the trie reuses only identical intermediate results (by construction, because each node stores the exact output of its prefix), we did not include an explicit side-by-side metric table. In the revision we will add a new table (or subsection) that reports, for each of the three pipelines, the final Recall@1000, nDCG@10 and MAP values obtained under both linear and trie execution; all values match to machine precision. This verification was performed during development but omitted from the original submission. revision: yes
Circularity Check
No circularity: empirical timing result is measured, not derived by construction
full rationale
The paper describes a trie data structure for sharing prefixes in IR pipeline experiment plans and reports a directly measured 26% reduction in wall-clock duration on one concrete setup (BM25 + MonoT5 + DuoT5 on MSMARCO v2). No equations, fitted parameters, or self-citations are invoked to obtain the efficiency number; the claim rests on external timing instrumentation. The equivalence of recall/precision between trie and linear execution is an implementation correctness requirement, not a definitional or self-referential step. No load-bearing self-citation, ansatz smuggling, or renaming of known results occurs. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tries can represent experiment plans such that common prefixes correspond to shared computation that can be executed once.
Reference graph
Works this paper leans on
-
[1]
N. Tonellotto, C. Macdonald, I. Ounis, Efficient and effective retrieval using selective pruning, in: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, 2013, pp. 63–72. doi:10.1145/2433396.2433407
-
[2]
Liu, Learning to rank for information retrieval, Found
T. Liu, Learning to rank for information retrieval, Found. Trends Inf. Retr. 3 (2009) 225–331. doi:10.1561/1500000016
-
[3]
L. Wang, J. Lin, D. Metzler, A cascade ranking model for efficient ranked retrieval, in: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 105–114. doi:10.1145/2009916.2009934
-
[4]
S. MacAvaney, N. Tonellotto, C. Macdonald, Adaptive re-ranking with a corpus graph, in: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 1491–1500. doi:10. 1145/3511808.3557231
-
[5]
R. Pradeep, R. Nogueira, J. Lin, The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models, CoRR abs/2101.05667 (2021). URL: https://arxiv.org/abs/ 2101.05667.arXiv:2101.05667
-
[6]
R. Pradeep, S. Sharifymoghaddam, J. Lin, RankVicuna: Zero-shot listwise document rerank- ing with open-source large language models, 2023. URL: https://arxiv.org/abs/2309.15088. arXiv:2309.15088
-
[7]
C. Macdonald, Z. Shen, N. Tonellotto, A replicability study of joint product quantisation for effective space-efficient dense retrieval, in: Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2026. doi:10.1145/3805712.3808565, in press
-
[8]
C. Macdonald, N. Tonellotto, S. MacAvaney, I. Ounis, PyTerrier: Declarative experimentation in Python from BM25 to dense retrieval, in: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 4526–4533. doi:10.1145/3459637.3482013
-
[9]
C. Macdonald, J. Fang, A. Parry, Z. Meng, Constructing and Evaluating Declarative RAG Pipelines in PyTerrier, in: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2025, pp. 4035–4040. doi:10.1145/3726302. 3730150
-
[10]
E. G. Lionis, C. Macdonald, S. MacAvaney, Pipeline inspection, visualization, and interoperability in PyTerrier, in: Advances in Information Retrieval - 48th European Conference on Information Retrieval, ECIR 2026, Delft, The Netherlands, March 29 - April 2, 2026, Proceedings, Part IV, Lecture Notes in Computer Science, Springer, 2026, pp. 159–165. doi:10...
-
[11]
S. MacAvaney, A. Yates, S. Feldman, D. Downey, A. Cohan, N. Goharian, Simplified data wrangling with ir_datasets, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 2429–2436. doi:10.1145/3404835.3463254
-
[12]
G. Penha, A. Câmara, C. Hauff, Evaluating the robustness of retrieval pipelines with query variation generators, in: Advances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022, Proceedings, Part I, Lecture Notes in Computer Science, Springer, 2022, pp. 397–412. doi:10.1007/978-3-030-99736-6\_27
-
[13]
MacAvaney, C
S. MacAvaney, C. Macdonald, On precomputation and caching in information retrieval experiments with pipeline architectures, in: Proceedings of the 2nd International Workshop on Open Web Search co-located with the 47th European Conference on Information Retrieval (ECIR 2025), Lucca, Italy, April 10, 2025, CEUR Workshop Proceedings, CEUR-WS.org, 2025, pp. 2...
2025
-
[14]
N. Tonellotto, C. Macdonald, I. Ounis, Efficient query processing for scalable web search, Founda- tions and Trends in Information Retrieval 12 (2018) 319–500. doi:10.1561/1500000057
-
[15]
T. K. Sellis, Multiple-query optimization, ACM Transactions on Database Systems (TODS) 13 (1988) 23–52
1988
-
[16]
D. Straube, M. Ozsu, Query optimization and execution plan generation in object-oriented data management systems, IEEE Transactions on Knowledge and Data Engineering 7 (1995) 210–227. doi:10.1109/69.382293
-
[17]
Neumann, Efficient generation and execution of DAG-structured query graphs, Ph.D
T. Neumann, Efficient generation and execution of DAG-structured query graphs, Ph.D. thesis, University of Mannheim, Germany, 2005. URL: https://api.semanticscholar.org/CorpusID:487710
2005
-
[18]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, M. Zaharia, Spark SQL: Relational data processing in Spark, in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 1383–1394. doi: 10.11...
-
[19]
M.-M. Aseman-Manzar, S. Karimian-Aliabadi, R. Entezari-Maleki, B. Egger, A. Movaghar, Cost- aware resource recommendation for DAG-based big data workflows: An Apache Spark case study, IEEE Transactions on Services Computing 16 (2023) 1726–1737. doi: 10.1109/TSC.2022. 3203010
-
[20]
C. Macdonald, Combining Terrier with Apache Spark to create agile experimental information retrieval pipelines, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, Association for Computing Machinery, New York, NY, USA, 2018, p. 1309–1312. doi:10.1145/3209978.3210174
-
[21]
S. Harizopoulos, V. Shkapenyuk, A. Ailamaki, Qpipe: a simultaneously pipelined relational query engine, in: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD ’05, Association for Computing Machinery, New York, NY, USA, 2005, p. 383–394. doi:10.1145/1066157.1066201
-
[22]
M. Fröbe, J. H. Reimer, S. MacAvaney, N. Deckers, S. Reich, J. Bevendorff, B. Stein, M. Hagen, M. Potthast, The information retrieval experiment platform, in: Proceedings of the 46th Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023, ACM, 2023, pp. 2826–2836. doi:10.1145...
-
[23]
N. J. Larsson, Structures of String Matching and Data Compression., Lund University, Sweden, 1999
1999
-
[24]
Karimov, Trie Data Structure, Apress, Berkeley, CA, 2020, pp
E. Karimov, Trie Data Structure, Apress, Berkeley, CA, 2020, pp. 67–75. doi: 10.1007/ 978-1-4842-5769-2_9
2020
-
[25]
U. Gunasinghe, D. Alahakoon, The adaptive suffix tree: A space efficient sequence learning algorithm, in: The 2013 International Joint Conference on Neural Networks (IJCNN), IEEE, 2013, pp. 1–8. doi:10.1109/IJCNN.2013.6707052
-
[26]
D. R. Morrison, PATRICIA - practical algorithm to retrieve information coded in alphanumeric, J. ACM 15 (1968) 514–534. doi:10.1145/321479.321481
-
[27]
Nguyen, M
T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, MS MARCO: A Human Generated MAchine Reading COmprehen- sion Dataset (2016). URL: https://www.microsoft.com/en-us/research/publication/ ms-marco-human-generated-machine-reading-comprehension-dataset/
2016
-
[28]
N. Craswell, B. Mitra, E. Yilmaz, D. Campos, E. M. Voorhees, Overview of the TREC 2019 deep learning track, CoRR abs/2003.07820 (2020). URL: https://arxiv.org/abs/2003.07820
-
[29]
N. Craswell, B. Mitra, E. Yilmaz, D. Campos, J. Lin, Overview of the TREC 2021 deep learning track, 2025. URL: https://arxiv.org/abs/2507.08191
-
[30]
C. Macdonald, N. Tonellotto, Declarative experimentation in information retrieval using pyterrier, in: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, ICTIR ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 161–168. doi:10.1145/3409256.3409829
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.