Decision-Theoretic Stopping Rules for Document Screening

Aaron H.A. Fletcher; Mark Stevenson

arxiv: 2606.07071 · v1 · pith:QIBMHFSWnew · submitted 2026-06-05 · 💻 cs.IR

Decision-Theoretic Stopping Rules for Document Screening

Aaron H.A. Fletcher , Mark Stevenson This is my paper

Pith reviewed 2026-06-27 20:50 UTC · model grok-4.3

classification 💻 cs.IR

keywords stopping rulesdecision theorytechnology-assisted reviewdocument screeningexpected value of perfect informationpatent searchsystematic reviewsinformation retrieval

0 comments

The pith

Decision theory yields EVPI-based stopping policies for document screening that achieve higher net utility than recall-targeted rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies decision theory to the common problem of deciding when to stop reviewing search results instead of using fixed recall targets. It derives three practical stopping policies from the expected value of perfect information and tests them on patent examination and medical systematic review tasks. Experiments on CLEF-IP and medical datasets show these policies produce higher net utility under given cost and payoff settings. A sympathetic reader would care because existing methods ignore the specific reasons for screening and can lead to wasteful or incomplete review.

Core claim

Framing stopping as a decision problem under uncertainty allows derivation of EVPI policies that stop screening when the expected value of resolving uncertainty about remaining documents falls below the cost of further review; on CLEF-IP and systematic review datasets these policies yield higher net utility than existing TAR stopping rules across the evaluated cost-payoff settings.

What carries the argument

Expected Value of Perfect Information (EVPI) policies that quantify the benefit of knowing the true relevance status of unreviewed documents to decide whether continued screening is worthwhile.

If this is right

Stopping decisions become specific to the economic context of the search task rather than a single recall target.
In patent work the policies can reduce review volume while preserving the net value of found documents.
In systematic reviews the policies balance the cost of missing studies against review effort more directly than recall thresholds.
The same decision-theoretic framing can be reused for other professional search tasks that involve stopping under cost and payoff uncertainty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If costs and payoffs must be estimated rather than known exactly, the policies could be combined with sensitivity analysis to identify robust stopping points.
The approach might extend to dynamic settings where payoffs change as review progresses and new information arrives.
Larger-scale tests on streaming or multi-user search logs would show whether the utility advantage persists outside the two evaluated domains.

Load-bearing premise

The costs of reviewing a document and the payoffs for finding or missing a relevant one are known accurately enough in advance to compute the policies without substantial error.

What would settle it

Run the EVPI policies on the same datasets but with deliberately inaccurate cost or payoff values and check whether net utility falls below that of the recall-based baselines.

Figures

Figures reproduced from arXiv: 2606.07071 by Aaron H.A. Fletcher, Mark Stevenson.

**Figure 2.** Figure 2: Error type decomposition for systematic review screening ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Deciding when to stop reviewing the results of a search is a common problem with multiple applications. Existing stopping rules developed within Technology-Assisted Review (TAR) aim to achieve a pre-specified recall target and do not take into account the reason for examining the results, potentially leading to sub-optimal recommendations. This paper applies decision theory to the problem and uses it to derive three practical stopping policies based on the Expected Value of Perfect Information. The approach is applied to two professional search tasks: patent examining and systematic reviewing. Experiments on CLEF-IP and medical systematic review datasets show that the proposed approach generally produces more appropriate stopping decisions than existing methods, as demonstrated by higher net utility under the evaluated cost and payoff settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives three EVPI-based stopping policies for document screening and reports higher net utility than recall targets on CLEF-IP and medical datasets, but the gains rest on fixed known costs and payoffs with no sensitivity checks shown.

read the letter

The paper derives three practical stopping policies from expected value of perfect information and tests them on CLEF-IP patent data and medical systematic review collections. It claims these policies produce more appropriate stopping points than recall-target baselines when measured by net utility under the chosen cost and payoff settings.

The new element is the shift from pre-set recall goals to policies that explicitly weigh the value of additional information against review cost. The authors walk through the decision-theoretic setup and turn it into usable rules for two professional tasks. The experiments are run on standard collections and the comparison is straightforward, which gives the work a concrete empirical anchor.

The soft spot is the treatment of costs and payoffs as known quantities. The reported utility advantage is calculated under the same fixed parameters used to set the stopping thresholds, and the manuscript does not show how the advantage holds up if those values are perturbed. In practice those numbers are estimates, so any material uncertainty would affect both the policy and the evaluation. That is the main robustness question.

The work is aimed at people working on technology-assisted review for patent examination or systematic reviews. A reader focused on stopping rules or decision theory in IR would get direct value from the derivation and the head-to-head results.

It is worth sending to peer review. The framing is distinct enough and the experiments are on real tasks, even though the sensitivity issue needs attention from referees.

Referee Report

2 major / 1 minor

Summary. The paper proposes three stopping policies for document screening in professional search tasks (patent examination and systematic reviews) derived from the Expected Value of Perfect Information (EVPI) within a decision-theoretic framework. These policies are contrasted with existing recall-target methods from Technology-Assisted Review (TAR). Experiments on the CLEF-IP and medical systematic review datasets are reported to show generally higher net utility for the EVPI policies under the evaluated cost and payoff settings.

Significance. If the central claim holds after addressing parameter sensitivity, the work would provide a principled alternative to heuristic recall targets by explicitly incorporating task-specific costs and payoffs into stopping decisions. The use of two distinct real-world collections (CLEF-IP and systematic-review data) supplies a concrete empirical test of the approach.

major comments (2)

[Experiments section (CLEF-IP and systematic-review results)] The central claim that the EVPI policies produce higher net utility rests on treating the cost and payoff parameters as known exactly when both deriving the stopping thresholds and computing the reported utilities. No sensitivity analysis to perturbations in these parameters is presented, which directly affects threshold reliability and the magnitude of the reported advantage.
[Method section (EVPI policy derivations)] The EVPI derivation presupposes that the cost/payoff vector is known with sufficient accuracy for the value-of-information calculations to be stable; the manuscript provides no analysis of how uncertainty in these values propagates into the stopping decisions or net-utility differences.

minor comments (1)

[Method section] Notation for the three EVPI policies could be introduced more explicitly with consistent symbols across the derivation and experimental tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments regarding parameter sensitivity. These observations correctly identify a gap in the current manuscript. We address each point below and commit to revisions that will strengthen the empirical support for the claims.

read point-by-point responses

Referee: [Experiments section (CLEF-IP and systematic-review results)] The central claim that the EVPI policies produce higher net utility rests on treating the cost and payoff parameters as known exactly when both deriving the stopping thresholds and computing the reported utilities. No sensitivity analysis to perturbations in these parameters is presented, which directly affects threshold reliability and the magnitude of the reported advantage.

Authors: We agree that the absence of sensitivity analysis limits the strength of the central claim. In the revised manuscript we will add a dedicated subsection to the Experiments section that systematically perturbs the cost and payoff parameters (e.g., multiplicative factors of 0.5, 0.8, 1.2, and 1.5) and reports the resulting net-utility differences and stopping decisions for the three EVPI policies versus the recall-target baselines on both CLEF-IP and the systematic-review collections. This will quantify the stability of the reported advantages. revision: yes
Referee: [Method section (EVPI policy derivations)] The EVPI derivation presupposes that the cost/payoff vector is known with sufficient accuracy for the value-of-information calculations to be stable; the manuscript provides no analysis of how uncertainty in these values propagates into the stopping decisions or net-utility differences.

Authors: The referee is correct that the derivations treat the cost/payoff vector as fixed. We will expand the Method section with a short discussion of this modeling assumption and its implications. The primary mitigation will be the empirical sensitivity analysis described above, which directly examines how perturbations affect both thresholds and net utilities. If space allows, we will also note that the EVPI formulas are differentiable in the parameters, permitting straightforward first-order propagation analysis, though the empirical results will constitute the main addition. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation from external decision theory and evaluation on independent datasets

full rationale

The paper derives EVPI-based stopping policies from standard decision theory assuming known costs/payoffs as inputs, then evaluates the resulting policies on external collections (CLEF-IP, medical systematic reviews) against recall-target baselines using net utility computed under those same parameters. This is consistent application of the method rather than reduction by construction: the baselines are not derived from the same parameters, the datasets are independent, and no equations or self-citations reduce the central claim to a tautology or fitted input. No self-definitional steps, fitted predictions, or load-bearing self-citations are present in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that decision-theoretic EVPI can be computed and applied to stopping decisions, plus the practical availability of cost and payoff values for the evaluated tasks.

free parameters (1)

cost and payoff settings
Net utility is computed under specific evaluated cost and payoff values that are not derived from first principles.

axioms (1)

domain assumption Expected Value of Perfect Information provides a sound basis for deriving stopping policies in document screening
Invoked to justify the three practical policies.

pith-pipeline@v0.9.1-grok · 5635 in / 1151 out tokens · 22841 ms · 2026-06-27T20:50:51.456305+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 34 canonical work pages

[1]

A. E. Ades, G. Lu, and K. Claxton. 2004. Expected Value of Sample Information Calculations in Medical Decision Modeling.Medical Decision Making24, 2 (March 2004), 207–227. doi:10.1177/0272989X04263162

work page doi:10.1177/0272989x04263162 2004
[2]

Leif Azzopardi. 2011. The economics in interactive information retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval(Beijing, China)(SIGIR ’11). Association for Computing Machinery, New York, NY, USA, 15–24. doi:10.1145/2009916.2009923

work page doi:10.1145/2009916.2009923 2011
[3]

Leif Azzopardi, Diane Kelly, and Kathy Brennan. 2013. How query cost affects search behavior. InProceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval(Dublin, Ireland)(SIGIR ’13). Association for Computing Machinery, New York, NY, USA, 23–32. doi:10. 1145/2484028.2484049

arXiv 2013
[4]

R. E. Barlow and H. D. Brunk. 1972. The Isotonic Regression Problem and its Dual.J. Amer. Statist. Assoc.67, 337 (March 1972), 140–147. doi:10.1080/01621459. 1972.10481216

work page doi:10.1080/01621459 1972
[5]

Reem Bin-Hezam and Mark Stevenson. 2024. RLStop: A Reinforcement Learning Stopping Method for TAR. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 2604–2608. doi:10.1145/3626772.3657911

work page doi:10.1145/3626772.3657911 2024
[6]

2006.Decision Modelling For Health Economic Evaluation

Andrew Briggs, Karl Claxton, and Mark Sculpher. 2006.Decision Modelling For Health Economic Evaluation. Oxford University PressOxford. doi:10.1093/oso/ 9780198526629.001.0001

work page doi:10.1093/oso/ 2006
[7]

Max W Callaghan and Finn Müller-Hansen. 2020. Statistical stopping criteria for automated screening in systematic reviews.Systematic Reviews9, 1 (Dec. 2020),

2020
[8]

doi:10.1186/s13643-020-01521-4

work page doi:10.1186/s13643-020-01521-4
[9]

Yuan Shih Chow, Herbert Robbins, David Siegmund, and Yuan Shih Chow. 1971. Great expectations: The theory of optimal stopping. Houghton Mifflin, Boston

1971
[10]

Karl Claxton. 1999. The irrelevance of inference: a decision-making approach to the stochastic evaluation of health care technologies.Journal of Health Economics 18, 3 (June 1999), 341–364. doi:10.1016/S0167-6296(98)00039-3

work page doi:10.1016/s0167-6296(98)00039-3 1999
[11]

William S. Cooper. 1973. On selecting a measure of retrieval effectiveness part II. Implementation of the philosophy.Journal of the American Society for Information Science24, 6 (Nov. 1973), 413–424. doi:10.1002/asi.4630240603

work page doi:10.1002/asi.4630240603 1973
[12]

Cormack and Maura R

Gordon V. Cormack and Maura R. Grossman. 2014. Evaluation of machine- learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th international ACM SIGIR conference on Research & develop- ment in information retrieval. ACM, Gold Coast Queensland Australia, 153–162. doi:10.1145/2600428.2609601

work page doi:10.1145/2600428.2609601 2014
[13]

Cormack and Maura R

Gordon V. Cormack and Maura R. Grossman. 2016. Engineering Quality and Reliability in Technology-Assisted Review. InProceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy)(SIGIR ’16). Association for Computing Machinery, New York, NY, USA, 75–84. doi:10.1145/2911451.2911510

work page doi:10.1145/2911451.2911510 2016
[14]

Cormack and Maura R

Gordon V. Cormack and Maura R. Grossman. 2018. The Quest for Total Recall. In Proceedings of the ACM Symposium on Document Engineering 2018. ACM, Halifax NS Canada, 1–2. doi:10.1145/3209280.3232788

work page doi:10.1145/3209280.3232788 2018
[15]

Giorgio Maria Di Nunzio. 2018. A Study of an Automatic Stopping Strategy for Technologically Assisted Medical Reviews. InAdvances in Information Retrieval, Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Springer International Publishing, Cham, 672–677

2018
[16]

John M. Dwyer. 2007. Howard Raiffa and Robert Schlaifer. Applied statistical decision theory. Boston: Clinton Press, Inc., 1961. 356 pages.Behavioral Science 7, 1 (Jan. 2007), 103–104. doi:10.1002/bs.3830070108

work page doi:10.1002/bs.3830070108 2007
[17]

Ferguson

Thomas S. Ferguson. 1989. Who Solved the Secretary Problem?Statist. Sci.4, 3 (Aug. 1989), 294–296. doi:10.1214/ss/1177012493

work page doi:10.1214/ss/1177012493 1989
[18]

Aaron Fletcher and Mark Stevenson. 2026. Confidence-Based Stopping Methods for Systematic Reviews. InProceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’26) (Melbourne, VIC, Australia). Association for Computing Machinery, New York, NY, USA

2026
[19]

J. C. Gittins. 1979. Bandit Processes and Dynamic Allocation Indices.Journal of the Royal Statistical Society Series B: Statistical Methodology41, 2 (Jan. 1979), 148–164. doi:10.1111/j.2517-6161.1979.tb01068.x

work page doi:10.1111/j.2517-6161.1979.tb01068.x 1979
[20]

C. A. E. Goodhart. 1984. Problems of Monetary Management: The UK Experience. InMonetary Theory and Practice. Macmillan Education UK, London, 91–121. doi:10.1007/978-1-349-17295-5_4

work page doi:10.1007/978-1-349-17295-5_4 1984
[21]

Grossman, Gordon V

Maura R. Grossman, Gordon V. Cormack, and Adam Roegiest. 2016. TREC 2016 Total Recall Track Overview. InText Retrieval Conference. https://api. semanticscholar.org/CorpusID:5826060

2016
[22]

Anna Heath and Gianluca Baio. 2018. Calculating the Expected Value of Sample Information Using Efficient Nested Monte Carlo: A Tutorial.Value in Health21, 11 (2018), 1299–1304. doi:10.1016/j.jval.2018.05.004

work page doi:10.1016/j.jval.2018.05.004 2018
[23]

JPT Higgins, J Chandler, M Cumpston, T Li, MJ Page, and VA Welch. 2024. Cochrane Handbook for Systematic Reviews of Interventions. Vol. 6.5. Cochrane. www.cochrane.org/handbook

2024
[24]

Ronald Howard. 1966. Information Value Theory.IEEE Transactions on Systems Science and Cybernetics2, 1 (1966), 22–26. doi:10.1109/TSSC.1966.300074

work page doi:10.1109/tssc.1966.300074 1966
[25]

2011.Finding what works in health care: standards for systematic reviews

Institute of Medicine (U.S.) and Jill Eden (Eds.). 2011.Finding what works in health care: standards for systematic reviews. National Academies Press, Washington, D.C

2011
[26]

Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2017. CLEF 2017 technologically assisted reviews in empirical medicine overview, In 18th Working Notes of CLEF Conference and Labs of the Evaluation Forum.CEUR Workshop Proceedings1866, 1–29. https://www.scopus.com/inward/record.uri?eid=2-s2.0- 85034732447&partnerID=40&md5=a183b346edceb1918338a...

2017
[27]

Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2018. CLEF 2018 technologically assisted reviews in empirical medicine overview, In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018.CEUR Workshop Proceedings2125. https:// strathprints.strath.ac.uk/66446/

2018
[28]

Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2019. CLEF 2019 technology assisted reviews in empirical medicine overview, In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019.CEUR Workshop Proceedings2380. https: //strathprints.strath.ac.uk/71253/

2019
[29]

Donald H Kraft and T Lee. 1979. Stopping rules and their effect on expected search length.Information Processing and Management15, 1 (1979), 47–58. doi:10.1016/0306-4573(79)90007-4 Decision-Theoretic Stopping Rules for Document Screening

work page doi:10.1016/0306-4573(79)90007-4 1979
[30]

Lewis, Eugene Yang, and Ophir Frieder

David D. Lewis, Eugene Yang, and Ophir Frieder. 2021. Certifying One-Phase Technology-Assisted Reviews. InProceedings of the 30th ACM International Con- ference on Information & Knowledge Management(Virtual Event, Queensland, Australia)(CIKM ’21). Association for Computing Machinery, New York, NY, USA, 893–902. doi:10.1145/3459637.3482415

work page doi:10.1145/3459637.3482415 2021
[31]

Dan Li and Evangelos Kanoulas. 2020. When to Stop Reviewing in Technology- Assisted Reviews: Sampling from an Adaptive Distribution to Estimate Residual Relevant Documents.ACM Trans. Inf. Syst.38, 4, Article 41 (Sept. 2020), 36 pages. doi:10.1145/3411755

work page doi:10.1145/3411755 2020
[32]

Parvaz Mahdabi, Mostafa Keikha, Shima Gerani, Monica Landoni, and Fabio Crestani. 2011. Building Queries for Prior-Art Search. InMultidisciplinary Infor- mation Retrieval, Allan Hanbury, Andreas Rauber, and Arjen P. de Vries (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 3–15

2011
[33]

David Maxwell, Leif Azzopardi, Kalervo Järvelin, and Heikki Keskustalo. 2015. Searching and Stopping: An Analysis of Stopping Rules and Strategies. InProceed- ings of the 24th ACM International on Conference on Information and Knowledge Management(Melbourne, Australia)(CIKM ’15). Association for Computing Machinery, New York, NY, USA, 313–322. doi:10.1145...

work page doi:10.1145/2806416.2806476 2015
[34]

J. J. McCall. 1970. Economics of Information and Job Search.The Quarterly Journal of Economics84, 1 (Feb. 1970), 113. doi:10.2307/1879403

work page doi:10.2307/1879403 1970
[35]

Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness.ACM Transactions on Information Systems27, 1 (Dec. 2008), 1–27. doi:10.1145/1416950.1416952

work page doi:10.1145/1416950.1416952 2008
[36]

Alessio Molinari and Andrea Esuli. 2024. SALT: efficiently stopping TAR by improving priors estimates.Data Mining and Knowledge Discovery38, 2 (March 2024), 535–568. doi:10.1007/s10618-023-00961-5

work page doi:10.1007/s10618-023-00961-5 2024
[37]

Christopher Norman, Mariska Leeflang, and Aurélie Névéol. 2018. Data Ex- traction and Synthesis in Systematic Reviews of Diagnostic Test Accuracy: A Corpus for Automating and Evaluating the Process.AMIA ... Annual Symposium proceedings. AMIA Symposium2018 (2018), 817–826

2018
[38]

Norman, Mariska M

Christopher R. Norman, Mariska M. G. Leeflang, Raphaël Porcher, and Aurélie Névéol. 2019. Measuring the impact of screening automation on meta-analyses of diagnostic test accuracy.Systematic Reviews8, 1 (2019). doi:10.1186/s13643- 019-1162-x Publisher: Springer Science and Business Media LLC

work page doi:10.1186/s13643- 2019
[39]

Florina Piroi, Giovanna Roda, Veronika Zenz, and John Tait. 2021. The CLEF-IP 2009 Test Collection. doi:10.48436/9SXBQ-JS515

work page doi:10.48436/9sxbq-js515 2021
[40]

Reitsma, Afina S

Johannes B. Reitsma, Afina S. Glas, Anne W.S. Rutjes, Rob J.P.M. Scholten, Patrick M. Bossuyt, and Aeilko H. Zwinderman. 2005. Bivariate analysis of sensitivity and specificity produces informative summary measures in di- agnostic reviews.Journal of Clinical Epidemiology58, 10 (2005), 982–990. doi:10.1016/j.jclinepi.2005.02.022

work page doi:10.1016/j.jclinepi.2005.02.022 2005
[41]

Robertson

S.E. Robertson. 1977. The Probability Ranking Principle in IR.Journal of Docu- mentation33, 4 (April 1977), 294–304. doi:10.1108/eb026647

work page doi:10.1108/eb026647 1977
[42]

Giovanna Roda, John Tait, Florina Piroi, and Veronika Zenz. 2010. CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain. InMultilingual Information Access Evaluation I. Text Retrieval Experiments, Carol Peters, Gior- gio Maria Di Nunzio, Mikko Kurimo, Thomas Mandl, Djamel Mostefa, Anselmo Peñas, and Giovanna Roda (Eds.). Springer Berlin ...

2010
[43]

Cormack, Charles L

Adam Roegiest, Gordon V. Cormack, Charles L. A. Clarke, and Maura R. Gross- man. 2015. TREC 2015 Total Recall Track Overview. InProceedings of The Twenty-Fourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA, November 17-20, 2015, Ellen M. Voorhees and Angela Ellis (Eds.), Vol. Spe- cial Publication 500-319. National Institute of Standa...

2015
[44]

Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. 2011. Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior. In2011 31st International Conference on Distributed Computing Systems Workshops. IEEE, Minneapolis, MN, USA, 166–171. doi:10.1109/ICDCSW.2011.20

work page doi:10.1109/icdcsw.2011.20 2011
[45]

Leonard J. Savage. 1954. The foundations of statistics. By Leonard J. Savage, John Wiley & Sons, Inc., 1954, 294 pp.Naval Research Logistics Quarterly1, 3 (1954), 236–236. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/nav.3800010316 doi:10.1002/nav.3800010316

work page doi:10.1002/nav.3800010316 1954
[46]

Mark Stevenson and Reem Bin-Hezam. 2023. Stopping Methods for Technology- assisted Reviews Based on Point Processes.ACM Trans. Inf. Syst.42, 3 (Dec. 2023), 73. doi:10.1145/3631990

work page doi:10.1145/3631990 2023
[47]

George J. Stigler. 1961. The Economics of Information.Journal of Political Economy69, 3 (June 1961), 213–225. doi:10.1086/258464

work page doi:10.1086/258464 1961
[48]

Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and Language Models Examined. InProceedings of the 19th Australasian Document Computing Symposium(Melbourne, VIC, Australia)(ADCS ’14). As- sociation for Computing Machinery, New York, NY, USA, 58–65. doi:10.1145/ 2682862.2682863

arXiv 2014
[49]

2007.Theory of games and economic behavior(60

John Von Neumann and Oskar Morgenstern. 2007.Theory of games and economic behavior(60. anniversary ed., 4. print., and 1. paperb. print ed.). Princeton University Press, Princeton, NJ

2007
[50]

A. Wald. 1945. Sequential Tests of Statistical Hypotheses.The Annals of Mathe- matical Statistics16, 2 (June 1945), 117–186. doi:10.1214/aoms/1177731118

work page doi:10.1214/aoms/1177731118 1945
[51]

Milton Weinstein and Richard Zeckhauser. 1973. Critical ratios and efficient allocation.Journal of Public Economics2, 2 (April 1973), 147–157. doi:10.1016/ 0047-2727(73)90002-9

1973
[52]

Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into ac- curate multiclass probability estimates. InProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Edmon- ton Alberta Canada, 694–699. doi:10.1145/775047.775151

work page doi:10.1145/775047.775151 2002

[1] [1]

A. E. Ades, G. Lu, and K. Claxton. 2004. Expected Value of Sample Information Calculations in Medical Decision Modeling.Medical Decision Making24, 2 (March 2004), 207–227. doi:10.1177/0272989X04263162

work page doi:10.1177/0272989x04263162 2004

[2] [2]

Leif Azzopardi. 2011. The economics in interactive information retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval(Beijing, China)(SIGIR ’11). Association for Computing Machinery, New York, NY, USA, 15–24. doi:10.1145/2009916.2009923

work page doi:10.1145/2009916.2009923 2011

[3] [3]

Leif Azzopardi, Diane Kelly, and Kathy Brennan. 2013. How query cost affects search behavior. InProceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval(Dublin, Ireland)(SIGIR ’13). Association for Computing Machinery, New York, NY, USA, 23–32. doi:10. 1145/2484028.2484049

arXiv 2013

[4] [4]

R. E. Barlow and H. D. Brunk. 1972. The Isotonic Regression Problem and its Dual.J. Amer. Statist. Assoc.67, 337 (March 1972), 140–147. doi:10.1080/01621459. 1972.10481216

work page doi:10.1080/01621459 1972

[5] [5]

Reem Bin-Hezam and Mark Stevenson. 2024. RLStop: A Reinforcement Learning Stopping Method for TAR. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 2604–2608. doi:10.1145/3626772.3657911

work page doi:10.1145/3626772.3657911 2024

[6] [6]

2006.Decision Modelling For Health Economic Evaluation

Andrew Briggs, Karl Claxton, and Mark Sculpher. 2006.Decision Modelling For Health Economic Evaluation. Oxford University PressOxford. doi:10.1093/oso/ 9780198526629.001.0001

work page doi:10.1093/oso/ 2006

[7] [7]

Max W Callaghan and Finn Müller-Hansen. 2020. Statistical stopping criteria for automated screening in systematic reviews.Systematic Reviews9, 1 (Dec. 2020),

2020

[8] [8]

doi:10.1186/s13643-020-01521-4

work page doi:10.1186/s13643-020-01521-4

[9] [9]

Yuan Shih Chow, Herbert Robbins, David Siegmund, and Yuan Shih Chow. 1971. Great expectations: The theory of optimal stopping. Houghton Mifflin, Boston

1971

[10] [10]

Karl Claxton. 1999. The irrelevance of inference: a decision-making approach to the stochastic evaluation of health care technologies.Journal of Health Economics 18, 3 (June 1999), 341–364. doi:10.1016/S0167-6296(98)00039-3

work page doi:10.1016/s0167-6296(98)00039-3 1999

[11] [11]

William S. Cooper. 1973. On selecting a measure of retrieval effectiveness part II. Implementation of the philosophy.Journal of the American Society for Information Science24, 6 (Nov. 1973), 413–424. doi:10.1002/asi.4630240603

work page doi:10.1002/asi.4630240603 1973

[12] [12]

Cormack and Maura R

Gordon V. Cormack and Maura R. Grossman. 2014. Evaluation of machine- learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th international ACM SIGIR conference on Research & develop- ment in information retrieval. ACM, Gold Coast Queensland Australia, 153–162. doi:10.1145/2600428.2609601

work page doi:10.1145/2600428.2609601 2014

[13] [13]

Cormack and Maura R

Gordon V. Cormack and Maura R. Grossman. 2016. Engineering Quality and Reliability in Technology-Assisted Review. InProceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy)(SIGIR ’16). Association for Computing Machinery, New York, NY, USA, 75–84. doi:10.1145/2911451.2911510

work page doi:10.1145/2911451.2911510 2016

[14] [14]

Cormack and Maura R

Gordon V. Cormack and Maura R. Grossman. 2018. The Quest for Total Recall. In Proceedings of the ACM Symposium on Document Engineering 2018. ACM, Halifax NS Canada, 1–2. doi:10.1145/3209280.3232788

work page doi:10.1145/3209280.3232788 2018

[15] [15]

Giorgio Maria Di Nunzio. 2018. A Study of an Automatic Stopping Strategy for Technologically Assisted Medical Reviews. InAdvances in Information Retrieval, Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Springer International Publishing, Cham, 672–677

2018

[16] [16]

John M. Dwyer. 2007. Howard Raiffa and Robert Schlaifer. Applied statistical decision theory. Boston: Clinton Press, Inc., 1961. 356 pages.Behavioral Science 7, 1 (Jan. 2007), 103–104. doi:10.1002/bs.3830070108

work page doi:10.1002/bs.3830070108 2007

[17] [17]

Ferguson

Thomas S. Ferguson. 1989. Who Solved the Secretary Problem?Statist. Sci.4, 3 (Aug. 1989), 294–296. doi:10.1214/ss/1177012493

work page doi:10.1214/ss/1177012493 1989

[18] [18]

Aaron Fletcher and Mark Stevenson. 2026. Confidence-Based Stopping Methods for Systematic Reviews. InProceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’26) (Melbourne, VIC, Australia). Association for Computing Machinery, New York, NY, USA

2026

[19] [19]

J. C. Gittins. 1979. Bandit Processes and Dynamic Allocation Indices.Journal of the Royal Statistical Society Series B: Statistical Methodology41, 2 (Jan. 1979), 148–164. doi:10.1111/j.2517-6161.1979.tb01068.x

work page doi:10.1111/j.2517-6161.1979.tb01068.x 1979

[20] [20]

C. A. E. Goodhart. 1984. Problems of Monetary Management: The UK Experience. InMonetary Theory and Practice. Macmillan Education UK, London, 91–121. doi:10.1007/978-1-349-17295-5_4

work page doi:10.1007/978-1-349-17295-5_4 1984

[21] [21]

Grossman, Gordon V

Maura R. Grossman, Gordon V. Cormack, and Adam Roegiest. 2016. TREC 2016 Total Recall Track Overview. InText Retrieval Conference. https://api. semanticscholar.org/CorpusID:5826060

2016

[22] [22]

Anna Heath and Gianluca Baio. 2018. Calculating the Expected Value of Sample Information Using Efficient Nested Monte Carlo: A Tutorial.Value in Health21, 11 (2018), 1299–1304. doi:10.1016/j.jval.2018.05.004

work page doi:10.1016/j.jval.2018.05.004 2018

[23] [23]

JPT Higgins, J Chandler, M Cumpston, T Li, MJ Page, and VA Welch. 2024. Cochrane Handbook for Systematic Reviews of Interventions. Vol. 6.5. Cochrane. www.cochrane.org/handbook

2024

[24] [24]

Ronald Howard. 1966. Information Value Theory.IEEE Transactions on Systems Science and Cybernetics2, 1 (1966), 22–26. doi:10.1109/TSSC.1966.300074

work page doi:10.1109/tssc.1966.300074 1966

[25] [25]

2011.Finding what works in health care: standards for systematic reviews

Institute of Medicine (U.S.) and Jill Eden (Eds.). 2011.Finding what works in health care: standards for systematic reviews. National Academies Press, Washington, D.C

2011

[26] [26]

Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2017. CLEF 2017 technologically assisted reviews in empirical medicine overview, In 18th Working Notes of CLEF Conference and Labs of the Evaluation Forum.CEUR Workshop Proceedings1866, 1–29. https://www.scopus.com/inward/record.uri?eid=2-s2.0- 85034732447&partnerID=40&md5=a183b346edceb1918338a...

2017

[27] [27]

Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2018. CLEF 2018 technologically assisted reviews in empirical medicine overview, In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018.CEUR Workshop Proceedings2125. https:// strathprints.strath.ac.uk/66446/

2018

[28] [28]

Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2019. CLEF 2019 technology assisted reviews in empirical medicine overview, In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019.CEUR Workshop Proceedings2380. https: //strathprints.strath.ac.uk/71253/

2019

[29] [29]

Donald H Kraft and T Lee. 1979. Stopping rules and their effect on expected search length.Information Processing and Management15, 1 (1979), 47–58. doi:10.1016/0306-4573(79)90007-4 Decision-Theoretic Stopping Rules for Document Screening

work page doi:10.1016/0306-4573(79)90007-4 1979

[30] [30]

Lewis, Eugene Yang, and Ophir Frieder

David D. Lewis, Eugene Yang, and Ophir Frieder. 2021. Certifying One-Phase Technology-Assisted Reviews. InProceedings of the 30th ACM International Con- ference on Information & Knowledge Management(Virtual Event, Queensland, Australia)(CIKM ’21). Association for Computing Machinery, New York, NY, USA, 893–902. doi:10.1145/3459637.3482415

work page doi:10.1145/3459637.3482415 2021

[31] [31]

Dan Li and Evangelos Kanoulas. 2020. When to Stop Reviewing in Technology- Assisted Reviews: Sampling from an Adaptive Distribution to Estimate Residual Relevant Documents.ACM Trans. Inf. Syst.38, 4, Article 41 (Sept. 2020), 36 pages. doi:10.1145/3411755

work page doi:10.1145/3411755 2020

[32] [32]

Parvaz Mahdabi, Mostafa Keikha, Shima Gerani, Monica Landoni, and Fabio Crestani. 2011. Building Queries for Prior-Art Search. InMultidisciplinary Infor- mation Retrieval, Allan Hanbury, Andreas Rauber, and Arjen P. de Vries (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 3–15

2011

[33] [33]

David Maxwell, Leif Azzopardi, Kalervo Järvelin, and Heikki Keskustalo. 2015. Searching and Stopping: An Analysis of Stopping Rules and Strategies. InProceed- ings of the 24th ACM International on Conference on Information and Knowledge Management(Melbourne, Australia)(CIKM ’15). Association for Computing Machinery, New York, NY, USA, 313–322. doi:10.1145...

work page doi:10.1145/2806416.2806476 2015

[34] [34]

J. J. McCall. 1970. Economics of Information and Job Search.The Quarterly Journal of Economics84, 1 (Feb. 1970), 113. doi:10.2307/1879403

work page doi:10.2307/1879403 1970

[35] [35]

Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness.ACM Transactions on Information Systems27, 1 (Dec. 2008), 1–27. doi:10.1145/1416950.1416952

work page doi:10.1145/1416950.1416952 2008

[36] [36]

Alessio Molinari and Andrea Esuli. 2024. SALT: efficiently stopping TAR by improving priors estimates.Data Mining and Knowledge Discovery38, 2 (March 2024), 535–568. doi:10.1007/s10618-023-00961-5

work page doi:10.1007/s10618-023-00961-5 2024

[37] [37]

Christopher Norman, Mariska Leeflang, and Aurélie Névéol. 2018. Data Ex- traction and Synthesis in Systematic Reviews of Diagnostic Test Accuracy: A Corpus for Automating and Evaluating the Process.AMIA ... Annual Symposium proceedings. AMIA Symposium2018 (2018), 817–826

2018

[38] [38]

Norman, Mariska M

Christopher R. Norman, Mariska M. G. Leeflang, Raphaël Porcher, and Aurélie Névéol. 2019. Measuring the impact of screening automation on meta-analyses of diagnostic test accuracy.Systematic Reviews8, 1 (2019). doi:10.1186/s13643- 019-1162-x Publisher: Springer Science and Business Media LLC

work page doi:10.1186/s13643- 2019

[39] [39]

Florina Piroi, Giovanna Roda, Veronika Zenz, and John Tait. 2021. The CLEF-IP 2009 Test Collection. doi:10.48436/9SXBQ-JS515

work page doi:10.48436/9sxbq-js515 2021

[40] [40]

Reitsma, Afina S

Johannes B. Reitsma, Afina S. Glas, Anne W.S. Rutjes, Rob J.P.M. Scholten, Patrick M. Bossuyt, and Aeilko H. Zwinderman. 2005. Bivariate analysis of sensitivity and specificity produces informative summary measures in di- agnostic reviews.Journal of Clinical Epidemiology58, 10 (2005), 982–990. doi:10.1016/j.jclinepi.2005.02.022

work page doi:10.1016/j.jclinepi.2005.02.022 2005

[41] [41]

Robertson

S.E. Robertson. 1977. The Probability Ranking Principle in IR.Journal of Docu- mentation33, 4 (April 1977), 294–304. doi:10.1108/eb026647

work page doi:10.1108/eb026647 1977

[42] [42]

Giovanna Roda, John Tait, Florina Piroi, and Veronika Zenz. 2010. CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain. InMultilingual Information Access Evaluation I. Text Retrieval Experiments, Carol Peters, Gior- gio Maria Di Nunzio, Mikko Kurimo, Thomas Mandl, Djamel Mostefa, Anselmo Peñas, and Giovanna Roda (Eds.). Springer Berlin ...

2010

[43] [43]

Cormack, Charles L

Adam Roegiest, Gordon V. Cormack, Charles L. A. Clarke, and Maura R. Gross- man. 2015. TREC 2015 Total Recall Track Overview. InProceedings of The Twenty-Fourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA, November 17-20, 2015, Ellen M. Voorhees and Angela Ellis (Eds.), Vol. Spe- cial Publication 500-319. National Institute of Standa...

2015

[44] [44]

Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. 2011. Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior. In2011 31st International Conference on Distributed Computing Systems Workshops. IEEE, Minneapolis, MN, USA, 166–171. doi:10.1109/ICDCSW.2011.20

work page doi:10.1109/icdcsw.2011.20 2011

[45] [45]

Leonard J. Savage. 1954. The foundations of statistics. By Leonard J. Savage, John Wiley & Sons, Inc., 1954, 294 pp.Naval Research Logistics Quarterly1, 3 (1954), 236–236. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/nav.3800010316 doi:10.1002/nav.3800010316

work page doi:10.1002/nav.3800010316 1954

[46] [46]

Mark Stevenson and Reem Bin-Hezam. 2023. Stopping Methods for Technology- assisted Reviews Based on Point Processes.ACM Trans. Inf. Syst.42, 3 (Dec. 2023), 73. doi:10.1145/3631990

work page doi:10.1145/3631990 2023

[47] [47]

George J. Stigler. 1961. The Economics of Information.Journal of Political Economy69, 3 (June 1961), 213–225. doi:10.1086/258464

work page doi:10.1086/258464 1961

[48] [48]

Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and Language Models Examined. InProceedings of the 19th Australasian Document Computing Symposium(Melbourne, VIC, Australia)(ADCS ’14). As- sociation for Computing Machinery, New York, NY, USA, 58–65. doi:10.1145/ 2682862.2682863

arXiv 2014

[49] [49]

2007.Theory of games and economic behavior(60

John Von Neumann and Oskar Morgenstern. 2007.Theory of games and economic behavior(60. anniversary ed., 4. print., and 1. paperb. print ed.). Princeton University Press, Princeton, NJ

2007

[50] [50]

A. Wald. 1945. Sequential Tests of Statistical Hypotheses.The Annals of Mathe- matical Statistics16, 2 (June 1945), 117–186. doi:10.1214/aoms/1177731118

work page doi:10.1214/aoms/1177731118 1945

[51] [51]

Milton Weinstein and Richard Zeckhauser. 1973. Critical ratios and efficient allocation.Journal of Public Economics2, 2 (April 1973), 147–157. doi:10.1016/ 0047-2727(73)90002-9

1973

[52] [52]

Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into ac- curate multiclass probability estimates. InProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Edmon- ton Alberta Canada, 694–699. doi:10.1145/775047.775151

work page doi:10.1145/775047.775151 2002