pith. sign in

arxiv: 2607.01162 · v1 · pith:SZYHAAY7new · submitted 2026-07-01 · 💻 cs.IR

Trie-based Experiment Plans for Efficient IR Pipeline Experiments

Pith reviewed 2026-07-02 06:25 UTC · model grok-4.3

classification 💻 cs.IR
keywords information retrievalpipeline evaluationtrie data structureexperiment efficiencycascading pipelinesPyTerrierMSMARCOcomparative experiments
0
0 comments X

The pith

A trie for experiment plans shares intermediate results across retrieval pipeline variants to reduce total evaluation time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes encoding comparative experiments on cascading retrieval pipelines as a trie rather than a linear sequence of runs. This structure identifies and reuses outputs from shared early stages, such as an initial retriever, when different later rerankers are tested. The approach works with the declarative pipeline syntax in PyTerrier and leaves the recorded recall and precision values unchanged. In a concrete test with BM25 followed by MonoT5 and DuoT5 on MSMARCO v2, the trie plan finished 26 percent faster than the sequential alternative. The authors also describe how the method was received by student users.

Core claim

Formulating an experiment plan as a trie allows repeated sub-computations in different pipeline variants to be executed only once, which shortens overall runtime for comparative evaluations of cascading retrieval systems while preserving the measured effectiveness of each pipeline.

What carries the argument

The trie data structure that represents experiment plans by modeling branching points where pipeline variants diverge, thereby enabling reuse of shared prefix computations.

If this is right

  • Pipeline variants that share an initial retriever incur the cost of that retriever only once.
  • The measured effectiveness of each complete pipeline remains identical to a sequential execution.
  • More pipeline variants can be compared within a fixed time or compute budget.
  • The technique applies to any declarative pipeline description supported by PyTerrier.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sharing principle could be applied to multi-stage pipelines that feed documents into large language models for retrieval-augmented generation.
  • Controlled studies of how early-stage recall affects final precision could test many more variants at lower cost.
  • Industrial A/B testing of ranking changes might reuse common retrieval stages across candidate configurations.

Load-bearing premise

The trie representation of experiment plans can share intermediate results across pipeline variants without changing the measured recall or precision values or introducing ordering artifacts.

What would settle it

Run the BM25-MonoT5-DuoT5 experiment plan on MSMARCO v2 once with a linear schedule and once with the trie planner on identical hardware, then check whether effectiveness scores match exactly and whether the trie version finishes measurably sooner.

Figures

Figures reproduced from arXiv: 2607.01162 by Craig Macdonald, Irene Anu.

Figure 1
Figure 1. Figure 1: PyTerrier generated schematic [10] visualisation of the pipeline from Listing 1. once and its output is broadcast to all consuming operators. Our work is conceptually aligned with this idea, where we also exploit shared computation across multiple query pipelines. However, instead of sharing relational operators at runtime, we identify prefix overlaps across IR pipelines and execute these shared stages onc… view at source ↗
Figure 2
Figure 2. Figure 2: Structural comparison between execution DAG and radix tree implementation derived from the pipelines in Listing 2. 4.1. Definition Let 𝒫 be a finite set of transformer pipelines. Each pipeline 𝑃𝑖 ∈ 𝒫 is an ordered sequence of stages 𝑃𝑖,1 ≫ 𝑃𝑖,2 ≫ · · · ≫ 𝑃𝑖,‖𝑃𝑖‖ , where ‖𝑃𝑖‖ denotes the number of stages of the pipeline and 𝑃𝑖,𝑙 its 𝑙-th stage. For 0 ≤ 𝑙 ≤ ‖𝑃𝑖‖, let pref (𝑃𝑖 , 𝑙) denote the prefix consistin… view at source ↗
Figure 3
Figure 3. Figure 3: Visualisation of experiment execution progress. This visualisation is shown in notebook environments whenever plan=’tree’ is set. stops, even if subsets of pipelines share additional prefixes deeper in their structure. A trie is instead able to identify every shared prefix, not just the global one. Whenever a set of pipelines share an initial subsequence, the radix trie represents it as a single node. This… view at source ↗
read the original abstract

Search engines are often formulated as cascading pipelines, where successive stages combine the results of different retrievers, and iteratively refine the ranking of candidate documents to obtain a final ranking, which can be presented to a user, or provided as context to an LLM. Such pipelines can be complex to evaluate in an end-to-end manner, necessitating measurement of Recall of early stages, and Precision of later stages, which are often interchangeable. PyTerrier is ideal for building and evaluating cascading retrieval pipelines, due to its declarative nature for pipeline construction and wide ecosystem of retrievers and rerankers. However, comparative evaluation of pipelines can be expensive due to repeated components. In this work, we describe the use of a trie data structure to formulate an experiment plan for comparative pipeline experiments that enhances experiment efficiency compared to a sequential "linear" plan. Empirically, on a demonstration experiment involving BM25, MonoT5 and DuoT5 on MSMARCO v2, we observe a 26% reduction in experiment duration. Finally, we report on a user study of undergraduate and postgraduate research students' use of the experiment plans.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes representing comparative IR pipeline experiments as a trie to share common prefixes (e.g., initial BM25 retrieval) across variants, thereby reducing redundant computation relative to a linear sequential plan. On a demonstration using BM25, MonoT5 and DuoT5 pipelines evaluated on MSMARCO v2, it reports a 26% reduction in total experiment duration and presents results from a user study of student researchers.

Significance. If the trie-based plans are shown to produce exactly the same per-pipeline recall and precision values as linear execution, the approach would offer a practical efficiency gain for the common task of comparing cascading retrieval pipelines in PyTerrier and similar frameworks.

major comments (2)
  1. [Abstract (empirical demonstration paragraph)] The abstract's central empirical claim of a 26% duration reduction provides no measurement protocol, controls for external caching, number of repetitions, or error bars. Because this number is the primary evidence for the efficiency benefit, the absence of these details is load-bearing.
  2. [Trie formulation (paragraph describing the trie)] The trie formulation implicitly assumes that shared intermediate rankings and scores can be reused across pipeline variants without altering recall or precision or introducing ordering artifacts. The manuscript must explicitly verify and tabulate that every pipeline's final metrics are identical under trie versus linear execution; any divergence would make the reported speedup incomparable to standard practice.
minor comments (1)
  1. [User study] The user-study section should report the exact number of participants, task instructions, and any quantitative metrics collected.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the empirical reporting and verification of result equivalence.

read point-by-point responses
  1. Referee: [Abstract (empirical demonstration paragraph)] The abstract's central empirical claim of a 26% duration reduction provides no measurement protocol, controls for external caching, number of repetitions, or error bars. Because this number is the primary evidence for the efficiency benefit, the absence of these details is load-bearing.

    Authors: We agree that the abstract lacks sufficient methodological detail on the timing experiment. In the revised manuscript we will expand the relevant abstract sentence to specify: (i) the measurement protocol (wall-clock time of the full experiment plan including all pipeline stages), (ii) controls (external caches disabled and a fresh JVM/Python process for each plan), (iii) number of repetitions (five independent runs), and (iv) variability (mean and standard deviation across runs, yielding 26% ± 3% reduction). These details already exist in our internal experimental logs and will be added without altering the reported central figure. revision: yes

  2. Referee: [Trie formulation (paragraph describing the trie)] The trie formulation implicitly assumes that shared intermediate rankings and scores can be reused across pipeline variants without altering recall or precision or introducing ordering artifacts. The manuscript must explicitly verify and tabulate that every pipeline's final metrics are identical under trie versus linear execution; any divergence would make the reported speedup incomparable to standard practice.

    Authors: We accept this point. Although the trie reuses only identical intermediate results (by construction, because each node stores the exact output of its prefix), we did not include an explicit side-by-side metric table. In the revision we will add a new table (or subsection) that reports, for each of the three pipelines, the final Recall@1000, nDCG@10 and MAP values obtained under both linear and trie execution; all values match to machine precision. This verification was performed during development but omitted from the original submission. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical timing result is measured, not derived by construction

full rationale

The paper describes a trie data structure for sharing prefixes in IR pipeline experiment plans and reports a directly measured 26% reduction in wall-clock duration on one concrete setup (BM25 + MonoT5 + DuoT5 on MSMARCO v2). No equations, fitted parameters, or self-citations are invoked to obtain the efficiency number; the claim rests on external timing instrumentation. The equivalence of recall/precision between trie and linear execution is an implementation correctness requirement, not a definitional or self-referential step. No load-bearing self-citation, ansatz smuggling, or renaming of known results occurs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pipeline stages produce deterministic, reusable intermediate results that can be safely shared via a trie without side effects on downstream metrics.

axioms (1)
  • domain assumption Tries can represent experiment plans such that common prefixes correspond to shared computation that can be executed once.
    Invoked when the abstract states that the trie enhances efficiency compared to a sequential linear plan.

pith-pipeline@v0.9.1-grok · 5716 in / 1241 out tokens · 33604 ms · 2026-07-02T06:25:38.937686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 24 canonical work pages

  1. [1]

    Tonellotto, C

    N. Tonellotto, C. Macdonald, I. Ounis, Efficient and effective retrieval using selective pruning, in: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, 2013, pp. 63–72. doi:10.1145/2433396.2433407

  2. [2]

    Liu, Learning to rank for information retrieval, Found

    T. Liu, Learning to rank for information retrieval, Found. Trends Inf. Retr. 3 (2009) 225–331. doi:10.1561/1500000016

  3. [3]

    L. Wang, J. Lin, D. Metzler, A cascade ranking model for efficient ranked retrieval, in: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 105–114. doi:10.1145/2009916.2009934

  4. [4]

    MacAvaney, N

    S. MacAvaney, N. Tonellotto, C. Macdonald, Adaptive re-ranking with a corpus graph, in: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 1491–1500. doi:10. 1145/3511808.3557231

  5. [5]

    Pradeep, R

    R. Pradeep, R. Nogueira, J. Lin, The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models, CoRR abs/2101.05667 (2021). URL: https://arxiv.org/abs/ 2101.05667.arXiv:2101.05667

  6. [6]

    Pradeep, S

    R. Pradeep, S. Sharifymoghaddam, J. Lin, RankVicuna: Zero-shot listwise document rerank- ing with open-source large language models, 2023. URL: https://arxiv.org/abs/2309.15088. arXiv:2309.15088

  7. [7]

    Macdonald, Z

    C. Macdonald, Z. Shen, N. Tonellotto, A replicability study of joint product quantisation for effective space-efficient dense retrieval, in: Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2026. doi:10.1145/3805712.3808565, in press

  8. [8]

    Macdonald, N

    C. Macdonald, N. Tonellotto, S. MacAvaney, I. Ounis, PyTerrier: Declarative experimentation in Python from BM25 to dense retrieval, in: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 4526–4533. doi:10.1145/3459637.3482013

  9. [9]

    Macdonald, J

    C. Macdonald, J. Fang, A. Parry, Z. Meng, Constructing and Evaluating Declarative RAG Pipelines in PyTerrier, in: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2025, pp. 4035–4040. doi:10.1145/3726302. 3730150

  10. [10]

    E. G. Lionis, C. Macdonald, S. MacAvaney, Pipeline inspection, visualization, and interoperability in PyTerrier, in: Advances in Information Retrieval - 48th European Conference on Information Retrieval, ECIR 2026, Delft, The Netherlands, March 29 - April 2, 2026, Proceedings, Part IV, Lecture Notes in Computer Science, Springer, 2026, pp. 159–165. doi:10...

  11. [11]

    MacAvaney, A

    S. MacAvaney, A. Yates, S. Feldman, D. Downey, A. Cohan, N. Goharian, Simplified data wrangling with ir_datasets, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 2429–2436. doi:10.1145/3404835.3463254

  12. [12]

    Penha, A

    G. Penha, A. Câmara, C. Hauff, Evaluating the robustness of retrieval pipelines with query variation generators, in: Advances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022, Proceedings, Part I, Lecture Notes in Computer Science, Springer, 2022, pp. 397–412. doi:10.1007/978-3-030-99736-6\_27

  13. [13]

    MacAvaney, C

    S. MacAvaney, C. Macdonald, On precomputation and caching in information retrieval experiments with pipeline architectures, in: Proceedings of the 2nd International Workshop on Open Web Search co-located with the 47th European Conference on Information Retrieval (ECIR 2025), Lucca, Italy, April 10, 2025, CEUR Workshop Proceedings, CEUR-WS.org, 2025, pp. 2...

  14. [14]

    Tonellotto, C

    N. Tonellotto, C. Macdonald, I. Ounis, Efficient query processing for scalable web search, Founda- tions and Trends in Information Retrieval 12 (2018) 319–500. doi:10.1561/1500000057

  15. [15]

    T. K. Sellis, Multiple-query optimization, ACM Transactions on Database Systems (TODS) 13 (1988) 23–52

  16. [16]

    Straube, M

    D. Straube, M. Ozsu, Query optimization and execution plan generation in object-oriented data management systems, IEEE Transactions on Knowledge and Data Engineering 7 (1995) 210–227. doi:10.1109/69.382293

  17. [17]

    Neumann, Efficient generation and execution of DAG-structured query graphs, Ph.D

    T. Neumann, Efficient generation and execution of DAG-structured query graphs, Ph.D. thesis, University of Mannheim, Germany, 2005. URL: https://api.semanticscholar.org/CorpusID:487710

  18. [18]

    Armbrust, R

    M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, M. Zaharia, Spark SQL: Relational data processing in Spark, in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 1383–1394. doi: 10.11...

  19. [19]

    Aseman-Manzar, S

    M.-M. Aseman-Manzar, S. Karimian-Aliabadi, R. Entezari-Maleki, B. Egger, A. Movaghar, Cost- aware resource recommendation for DAG-based big data workflows: An Apache Spark case study, IEEE Transactions on Services Computing 16 (2023) 1726–1737. doi: 10.1109/TSC.2022. 3203010

  20. [20]

    C. Macdonald, Combining Terrier with Apache Spark to create agile experimental information retrieval pipelines, in: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, Association for Computing Machinery, New York, NY, USA, 2018, p. 1309–1312. doi:10.1145/3209978.3210174

  21. [21]

    Harizopoulos, V

    S. Harizopoulos, V. Shkapenyuk, A. Ailamaki, Qpipe: a simultaneously pipelined relational query engine, in: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD ’05, Association for Computing Machinery, New York, NY, USA, 2005, p. 383–394. doi:10.1145/1066157.1066201

  22. [22]

    Fröbe, J

    M. Fröbe, J. H. Reimer, S. MacAvaney, N. Deckers, S. Reich, J. Bevendorff, B. Stein, M. Hagen, M. Potthast, The information retrieval experiment platform, in: Proceedings of the 46th Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023, ACM, 2023, pp. 2826–2836. doi:10.1145...

  23. [23]

    N. J. Larsson, Structures of String Matching and Data Compression., Lund University, Sweden, 1999

  24. [24]

    Karimov, Trie Data Structure, Apress, Berkeley, CA, 2020, pp

    E. Karimov, Trie Data Structure, Apress, Berkeley, CA, 2020, pp. 67–75. doi: 10.1007/ 978-1-4842-5769-2_9

  25. [25]

    Gunasinghe, D

    U. Gunasinghe, D. Alahakoon, The adaptive suffix tree: A space efficient sequence learning algorithm, in: The 2013 International Joint Conference on Neural Networks (IJCNN), IEEE, 2013, pp. 1–8. doi:10.1109/IJCNN.2013.6707052

  26. [26]

    D. R. Morrison, PATRICIA - practical algorithm to retrieve information coded in alphanumeric, J. ACM 15 (1968) 514–534. doi:10.1145/321479.321481

  27. [27]

    Nguyen, M

    T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, MS MARCO: A Human Generated MAchine Reading COmprehen- sion Dataset (2016). URL: https://www.microsoft.com/en-us/research/publication/ ms-marco-human-generated-machine-reading-comprehension-dataset/

  28. [28]

    Craswell, B

    N. Craswell, B. Mitra, E. Yilmaz, D. Campos, E. M. Voorhees, Overview of the TREC 2019 deep learning track, CoRR abs/2003.07820 (2020). URL: https://arxiv.org/abs/2003.07820

  29. [29]

    Craswell, B

    N. Craswell, B. Mitra, E. Yilmaz, D. Campos, J. Lin, Overview of the TREC 2021 deep learning track, 2025. URL: https://arxiv.org/abs/2507.08191

  30. [30]

    Macdonald, N

    C. Macdonald, N. Tonellotto, Declarative experimentation in information retrieval using pyterrier, in: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, ICTIR ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 161–168. doi:10.1145/3409256.3409829