Decision-Theoretic Stopping Rules for Document Screening
Pith reviewed 2026-06-27 20:50 UTC · model grok-4.3
The pith
Decision theory yields EVPI-based stopping policies for document screening that achieve higher net utility than recall-targeted rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Framing stopping as a decision problem under uncertainty allows derivation of EVPI policies that stop screening when the expected value of resolving uncertainty about remaining documents falls below the cost of further review; on CLEF-IP and systematic review datasets these policies yield higher net utility than existing TAR stopping rules across the evaluated cost-payoff settings.
What carries the argument
Expected Value of Perfect Information (EVPI) policies that quantify the benefit of knowing the true relevance status of unreviewed documents to decide whether continued screening is worthwhile.
If this is right
- Stopping decisions become specific to the economic context of the search task rather than a single recall target.
- In patent work the policies can reduce review volume while preserving the net value of found documents.
- In systematic reviews the policies balance the cost of missing studies against review effort more directly than recall thresholds.
- The same decision-theoretic framing can be reused for other professional search tasks that involve stopping under cost and payoff uncertainty.
Where Pith is reading between the lines
- If costs and payoffs must be estimated rather than known exactly, the policies could be combined with sensitivity analysis to identify robust stopping points.
- The approach might extend to dynamic settings where payoffs change as review progresses and new information arrives.
- Larger-scale tests on streaming or multi-user search logs would show whether the utility advantage persists outside the two evaluated domains.
Load-bearing premise
The costs of reviewing a document and the payoffs for finding or missing a relevant one are known accurately enough in advance to compute the policies without substantial error.
What would settle it
Run the EVPI policies on the same datasets but with deliberately inaccurate cost or payoff values and check whether net utility falls below that of the recall-based baselines.
Figures
read the original abstract
Deciding when to stop reviewing the results of a search is a common problem with multiple applications. Existing stopping rules developed within Technology-Assisted Review (TAR) aim to achieve a pre-specified recall target and do not take into account the reason for examining the results, potentially leading to sub-optimal recommendations. This paper applies decision theory to the problem and uses it to derive three practical stopping policies based on the Expected Value of Perfect Information. The approach is applied to two professional search tasks: patent examining and systematic reviewing. Experiments on CLEF-IP and medical systematic review datasets show that the proposed approach generally produces more appropriate stopping decisions than existing methods, as demonstrated by higher net utility under the evaluated cost and payoff settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes three stopping policies for document screening in professional search tasks (patent examination and systematic reviews) derived from the Expected Value of Perfect Information (EVPI) within a decision-theoretic framework. These policies are contrasted with existing recall-target methods from Technology-Assisted Review (TAR). Experiments on the CLEF-IP and medical systematic review datasets are reported to show generally higher net utility for the EVPI policies under the evaluated cost and payoff settings.
Significance. If the central claim holds after addressing parameter sensitivity, the work would provide a principled alternative to heuristic recall targets by explicitly incorporating task-specific costs and payoffs into stopping decisions. The use of two distinct real-world collections (CLEF-IP and systematic-review data) supplies a concrete empirical test of the approach.
major comments (2)
- [Experiments section (CLEF-IP and systematic-review results)] The central claim that the EVPI policies produce higher net utility rests on treating the cost and payoff parameters as known exactly when both deriving the stopping thresholds and computing the reported utilities. No sensitivity analysis to perturbations in these parameters is presented, which directly affects threshold reliability and the magnitude of the reported advantage.
- [Method section (EVPI policy derivations)] The EVPI derivation presupposes that the cost/payoff vector is known with sufficient accuracy for the value-of-information calculations to be stable; the manuscript provides no analysis of how uncertainty in these values propagates into the stopping decisions or net-utility differences.
minor comments (1)
- [Method section] Notation for the three EVPI policies could be introduced more explicitly with consistent symbols across the derivation and experimental tables.
Simulated Author's Rebuttal
We thank the referee for the constructive comments regarding parameter sensitivity. These observations correctly identify a gap in the current manuscript. We address each point below and commit to revisions that will strengthen the empirical support for the claims.
read point-by-point responses
-
Referee: [Experiments section (CLEF-IP and systematic-review results)] The central claim that the EVPI policies produce higher net utility rests on treating the cost and payoff parameters as known exactly when both deriving the stopping thresholds and computing the reported utilities. No sensitivity analysis to perturbations in these parameters is presented, which directly affects threshold reliability and the magnitude of the reported advantage.
Authors: We agree that the absence of sensitivity analysis limits the strength of the central claim. In the revised manuscript we will add a dedicated subsection to the Experiments section that systematically perturbs the cost and payoff parameters (e.g., multiplicative factors of 0.5, 0.8, 1.2, and 1.5) and reports the resulting net-utility differences and stopping decisions for the three EVPI policies versus the recall-target baselines on both CLEF-IP and the systematic-review collections. This will quantify the stability of the reported advantages. revision: yes
-
Referee: [Method section (EVPI policy derivations)] The EVPI derivation presupposes that the cost/payoff vector is known with sufficient accuracy for the value-of-information calculations to be stable; the manuscript provides no analysis of how uncertainty in these values propagates into the stopping decisions or net-utility differences.
Authors: The referee is correct that the derivations treat the cost/payoff vector as fixed. We will expand the Method section with a short discussion of this modeling assumption and its implications. The primary mitigation will be the empirical sensitivity analysis described above, which directly examines how perturbations affect both thresholds and net utilities. If space allows, we will also note that the EVPI formulas are differentiable in the parameters, permitting straightforward first-order propagation analysis, though the empirical results will constitute the main addition. revision: yes
Circularity Check
No circularity; derivation from external decision theory and evaluation on independent datasets
full rationale
The paper derives EVPI-based stopping policies from standard decision theory assuming known costs/payoffs as inputs, then evaluates the resulting policies on external collections (CLEF-IP, medical systematic reviews) against recall-target baselines using net utility computed under those same parameters. This is consistent application of the method rather than reduction by construction: the baselines are not derived from the same parameters, the datasets are independent, and no equations or self-citations reduce the central claim to a tautology or fitted input. No self-definitional steps, fitted predictions, or load-bearing self-citations are present in the provided text.
Axiom & Free-Parameter Ledger
free parameters (1)
- cost and payoff settings
axioms (1)
- domain assumption Expected Value of Perfect Information provides a sound basis for deriving stopping policies in document screening
Reference graph
Works this paper leans on
-
[1]
A. E. Ades, G. Lu, and K. Claxton. 2004. Expected Value of Sample Information Calculations in Medical Decision Modeling.Medical Decision Making24, 2 (March 2004), 207–227. doi:10.1177/0272989X04263162
-
[2]
Leif Azzopardi. 2011. The economics in interactive information retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval(Beijing, China)(SIGIR ’11). Association for Computing Machinery, New York, NY, USA, 15–24. doi:10.1145/2009916.2009923
-
[3]
Leif Azzopardi, Diane Kelly, and Kathy Brennan. 2013. How query cost affects search behavior. InProceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval(Dublin, Ireland)(SIGIR ’13). Association for Computing Machinery, New York, NY, USA, 23–32. doi:10. 1145/2484028.2484049
arXiv 2013
-
[4]
R. E. Barlow and H. D. Brunk. 1972. The Isotonic Regression Problem and its Dual.J. Amer. Statist. Assoc.67, 337 (March 1972), 140–147. doi:10.1080/01621459. 1972.10481216
-
[5]
Reem Bin-Hezam and Mark Stevenson. 2024. RLStop: A Reinforcement Learning Stopping Method for TAR. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 2604–2608. doi:10.1145/3626772.3657911
-
[6]
2006.Decision Modelling For Health Economic Evaluation
Andrew Briggs, Karl Claxton, and Mark Sculpher. 2006.Decision Modelling For Health Economic Evaluation. Oxford University PressOxford. doi:10.1093/oso/ 9780198526629.001.0001
-
[7]
Max W Callaghan and Finn Müller-Hansen. 2020. Statistical stopping criteria for automated screening in systematic reviews.Systematic Reviews9, 1 (Dec. 2020),
2020
-
[8]
doi:10.1186/s13643-020-01521-4
-
[9]
Yuan Shih Chow, Herbert Robbins, David Siegmund, and Yuan Shih Chow. 1971. Great expectations: The theory of optimal stopping. Houghton Mifflin, Boston
1971
-
[10]
Karl Claxton. 1999. The irrelevance of inference: a decision-making approach to the stochastic evaluation of health care technologies.Journal of Health Economics 18, 3 (June 1999), 341–364. doi:10.1016/S0167-6296(98)00039-3
-
[11]
William S. Cooper. 1973. On selecting a measure of retrieval effectiveness part II. Implementation of the philosophy.Journal of the American Society for Information Science24, 6 (Nov. 1973), 413–424. doi:10.1002/asi.4630240603
-
[12]
Gordon V. Cormack and Maura R. Grossman. 2014. Evaluation of machine- learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th international ACM SIGIR conference on Research & develop- ment in information retrieval. ACM, Gold Coast Queensland Australia, 153–162. doi:10.1145/2600428.2609601
-
[13]
Gordon V. Cormack and Maura R. Grossman. 2016. Engineering Quality and Reliability in Technology-Assisted Review. InProceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy)(SIGIR ’16). Association for Computing Machinery, New York, NY, USA, 75–84. doi:10.1145/2911451.2911510
-
[14]
Gordon V. Cormack and Maura R. Grossman. 2018. The Quest for Total Recall. In Proceedings of the ACM Symposium on Document Engineering 2018. ACM, Halifax NS Canada, 1–2. doi:10.1145/3209280.3232788
-
[15]
Giorgio Maria Di Nunzio. 2018. A Study of an Automatic Stopping Strategy for Technologically Assisted Medical Reviews. InAdvances in Information Retrieval, Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Springer International Publishing, Cham, 672–677
2018
-
[16]
John M. Dwyer. 2007. Howard Raiffa and Robert Schlaifer. Applied statistical decision theory. Boston: Clinton Press, Inc., 1961. 356 pages.Behavioral Science 7, 1 (Jan. 2007), 103–104. doi:10.1002/bs.3830070108
-
[17]
Monotone Regression Splines in Action
Thomas S. Ferguson. 1989. Who Solved the Secretary Problem?Statist. Sci.4, 3 (Aug. 1989), 294–296. doi:10.1214/ss/1177012493
-
[18]
Aaron Fletcher and Mark Stevenson. 2026. Confidence-Based Stopping Methods for Systematic Reviews. InProceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’26) (Melbourne, VIC, Australia). Association for Computing Machinery, New York, NY, USA
2026
-
[19]
J. C. Gittins. 1979. Bandit Processes and Dynamic Allocation Indices.Journal of the Royal Statistical Society Series B: Statistical Methodology41, 2 (Jan. 1979), 148–164. doi:10.1111/j.2517-6161.1979.tb01068.x
-
[20]
C. A. E. Goodhart. 1984. Problems of Monetary Management: The UK Experience. InMonetary Theory and Practice. Macmillan Education UK, London, 91–121. doi:10.1007/978-1-349-17295-5_4
-
[21]
Grossman, Gordon V
Maura R. Grossman, Gordon V. Cormack, and Adam Roegiest. 2016. TREC 2016 Total Recall Track Overview. InText Retrieval Conference. https://api. semanticscholar.org/CorpusID:5826060
2016
-
[22]
Anna Heath and Gianluca Baio. 2018. Calculating the Expected Value of Sample Information Using Efficient Nested Monte Carlo: A Tutorial.Value in Health21, 11 (2018), 1299–1304. doi:10.1016/j.jval.2018.05.004
-
[23]
JPT Higgins, J Chandler, M Cumpston, T Li, MJ Page, and VA Welch. 2024. Cochrane Handbook for Systematic Reviews of Interventions. Vol. 6.5. Cochrane. www.cochrane.org/handbook
2024
-
[24]
Ronald Howard. 1966. Information Value Theory.IEEE Transactions on Systems Science and Cybernetics2, 1 (1966), 22–26. doi:10.1109/TSSC.1966.300074
-
[25]
2011.Finding what works in health care: standards for systematic reviews
Institute of Medicine (U.S.) and Jill Eden (Eds.). 2011.Finding what works in health care: standards for systematic reviews. National Academies Press, Washington, D.C
2011
-
[26]
Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2017. CLEF 2017 technologically assisted reviews in empirical medicine overview, In 18th Working Notes of CLEF Conference and Labs of the Evaluation Forum.CEUR Workshop Proceedings1866, 1–29. https://www.scopus.com/inward/record.uri?eid=2-s2.0- 85034732447&partnerID=40&md5=a183b346edceb1918338a...
2017
-
[27]
Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2018. CLEF 2018 technologically assisted reviews in empirical medicine overview, In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018.CEUR Workshop Proceedings2125. https:// strathprints.strath.ac.uk/66446/
2018
-
[28]
Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2019. CLEF 2019 technology assisted reviews in empirical medicine overview, In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019.CEUR Workshop Proceedings2380. https: //strathprints.strath.ac.uk/71253/
2019
-
[29]
Donald H Kraft and T Lee. 1979. Stopping rules and their effect on expected search length.Information Processing and Management15, 1 (1979), 47–58. doi:10.1016/0306-4573(79)90007-4 Decision-Theoretic Stopping Rules for Document Screening
-
[30]
Lewis, Eugene Yang, and Ophir Frieder
David D. Lewis, Eugene Yang, and Ophir Frieder. 2021. Certifying One-Phase Technology-Assisted Reviews. InProceedings of the 30th ACM International Con- ference on Information & Knowledge Management(Virtual Event, Queensland, Australia)(CIKM ’21). Association for Computing Machinery, New York, NY, USA, 893–902. doi:10.1145/3459637.3482415
-
[31]
Dan Li and Evangelos Kanoulas. 2020. When to Stop Reviewing in Technology- Assisted Reviews: Sampling from an Adaptive Distribution to Estimate Residual Relevant Documents.ACM Trans. Inf. Syst.38, 4, Article 41 (Sept. 2020), 36 pages. doi:10.1145/3411755
-
[32]
Parvaz Mahdabi, Mostafa Keikha, Shima Gerani, Monica Landoni, and Fabio Crestani. 2011. Building Queries for Prior-Art Search. InMultidisciplinary Infor- mation Retrieval, Allan Hanbury, Andreas Rauber, and Arjen P. de Vries (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 3–15
2011
-
[33]
David Maxwell, Leif Azzopardi, Kalervo Järvelin, and Heikki Keskustalo. 2015. Searching and Stopping: An Analysis of Stopping Rules and Strategies. InProceed- ings of the 24th ACM International on Conference on Information and Knowledge Management(Melbourne, Australia)(CIKM ’15). Association for Computing Machinery, New York, NY, USA, 313–322. doi:10.1145...
-
[34]
J. J. McCall. 1970. Economics of Information and Job Search.The Quarterly Journal of Economics84, 1 (Feb. 1970), 113. doi:10.2307/1879403
-
[35]
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness.ACM Transactions on Information Systems27, 1 (Dec. 2008), 1–27. doi:10.1145/1416950.1416952
-
[36]
Alessio Molinari and Andrea Esuli. 2024. SALT: efficiently stopping TAR by improving priors estimates.Data Mining and Knowledge Discovery38, 2 (March 2024), 535–568. doi:10.1007/s10618-023-00961-5
-
[37]
Christopher Norman, Mariska Leeflang, and Aurélie Névéol. 2018. Data Ex- traction and Synthesis in Systematic Reviews of Diagnostic Test Accuracy: A Corpus for Automating and Evaluating the Process.AMIA ... Annual Symposium proceedings. AMIA Symposium2018 (2018), 817–826
2018
-
[38]
Christopher R. Norman, Mariska M. G. Leeflang, Raphaël Porcher, and Aurélie Névéol. 2019. Measuring the impact of screening automation on meta-analyses of diagnostic test accuracy.Systematic Reviews8, 1 (2019). doi:10.1186/s13643- 019-1162-x Publisher: Springer Science and Business Media LLC
-
[39]
Florina Piroi, Giovanna Roda, Veronika Zenz, and John Tait. 2021. The CLEF-IP 2009 Test Collection. doi:10.48436/9SXBQ-JS515
-
[40]
Johannes B. Reitsma, Afina S. Glas, Anne W.S. Rutjes, Rob J.P.M. Scholten, Patrick M. Bossuyt, and Aeilko H. Zwinderman. 2005. Bivariate analysis of sensitivity and specificity produces informative summary measures in di- agnostic reviews.Journal of Clinical Epidemiology58, 10 (2005), 982–990. doi:10.1016/j.jclinepi.2005.02.022
-
[41]
S.E. Robertson. 1977. The Probability Ranking Principle in IR.Journal of Docu- mentation33, 4 (April 1977), 294–304. doi:10.1108/eb026647
-
[42]
Giovanna Roda, John Tait, Florina Piroi, and Veronika Zenz. 2010. CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain. InMultilingual Information Access Evaluation I. Text Retrieval Experiments, Carol Peters, Gior- gio Maria Di Nunzio, Mikko Kurimo, Thomas Mandl, Djamel Mostefa, Anselmo Peñas, and Giovanna Roda (Eds.). Springer Berlin ...
2010
-
[43]
Cormack, Charles L
Adam Roegiest, Gordon V. Cormack, Charles L. A. Clarke, and Maura R. Gross- man. 2015. TREC 2015 Total Recall Track Overview. InProceedings of The Twenty-Fourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA, November 17-20, 2015, Ellen M. Voorhees and Angela Ellis (Eds.), Vol. Spe- cial Publication 500-319. National Institute of Standa...
2015
-
[44]
Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. 2011. Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior. In2011 31st International Conference on Distributed Computing Systems Workshops. IEEE, Minneapolis, MN, USA, 166–171. doi:10.1109/ICDCSW.2011.20
-
[45]
Leonard J. Savage. 1954. The foundations of statistics. By Leonard J. Savage, John Wiley & Sons, Inc., 1954, 294 pp.Naval Research Logistics Quarterly1, 3 (1954), 236–236. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/nav.3800010316 doi:10.1002/nav.3800010316
-
[46]
Mark Stevenson and Reem Bin-Hezam. 2023. Stopping Methods for Technology- assisted Reviews Based on Point Processes.ACM Trans. Inf. Syst.42, 3 (Dec. 2023), 73. doi:10.1145/3631990
-
[47]
George J. Stigler. 1961. The Economics of Information.Journal of Political Economy69, 3 (June 1961), 213–225. doi:10.1086/258464
-
[48]
Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and Language Models Examined. InProceedings of the 19th Australasian Document Computing Symposium(Melbourne, VIC, Australia)(ADCS ’14). As- sociation for Computing Machinery, New York, NY, USA, 58–65. doi:10.1145/ 2682862.2682863
arXiv 2014
-
[49]
2007.Theory of games and economic behavior(60
John Von Neumann and Oskar Morgenstern. 2007.Theory of games and economic behavior(60. anniversary ed., 4. print., and 1. paperb. print ed.). Princeton University Press, Princeton, NJ
2007
-
[50]
A. Wald. 1945. Sequential Tests of Statistical Hypotheses.The Annals of Mathe- matical Statistics16, 2 (June 1945), 117–186. doi:10.1214/aoms/1177731118
-
[51]
Milton Weinstein and Richard Zeckhauser. 1973. Critical ratios and efficient allocation.Journal of Public Economics2, 2 (April 1973), 147–157. doi:10.1016/ 0047-2727(73)90002-9
1973
-
[52]
Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into ac- curate multiclass probability estimates. InProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Edmon- ton Alberta Canada, 694–699. doi:10.1145/775047.775151
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.