pith. sign in

arxiv: 2606.07071 · v1 · pith:QIBMHFSWnew · submitted 2026-06-05 · 💻 cs.IR

Decision-Theoretic Stopping Rules for Document Screening

Pith reviewed 2026-06-27 20:50 UTC · model grok-4.3

classification 💻 cs.IR
keywords stopping rulesdecision theorytechnology-assisted reviewdocument screeningexpected value of perfect informationpatent searchsystematic reviewsinformation retrieval
0
0 comments X

The pith

Decision theory yields EVPI-based stopping policies for document screening that achieve higher net utility than recall-targeted rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies decision theory to the common problem of deciding when to stop reviewing search results instead of using fixed recall targets. It derives three practical stopping policies from the expected value of perfect information and tests them on patent examination and medical systematic review tasks. Experiments on CLEF-IP and medical datasets show these policies produce higher net utility under given cost and payoff settings. A sympathetic reader would care because existing methods ignore the specific reasons for screening and can lead to wasteful or incomplete review.

Core claim

Framing stopping as a decision problem under uncertainty allows derivation of EVPI policies that stop screening when the expected value of resolving uncertainty about remaining documents falls below the cost of further review; on CLEF-IP and systematic review datasets these policies yield higher net utility than existing TAR stopping rules across the evaluated cost-payoff settings.

What carries the argument

Expected Value of Perfect Information (EVPI) policies that quantify the benefit of knowing the true relevance status of unreviewed documents to decide whether continued screening is worthwhile.

If this is right

  • Stopping decisions become specific to the economic context of the search task rather than a single recall target.
  • In patent work the policies can reduce review volume while preserving the net value of found documents.
  • In systematic reviews the policies balance the cost of missing studies against review effort more directly than recall thresholds.
  • The same decision-theoretic framing can be reused for other professional search tasks that involve stopping under cost and payoff uncertainty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If costs and payoffs must be estimated rather than known exactly, the policies could be combined with sensitivity analysis to identify robust stopping points.
  • The approach might extend to dynamic settings where payoffs change as review progresses and new information arrives.
  • Larger-scale tests on streaming or multi-user search logs would show whether the utility advantage persists outside the two evaluated domains.

Load-bearing premise

The costs of reviewing a document and the payoffs for finding or missing a relevant one are known accurately enough in advance to compute the policies without substantial error.

What would settle it

Run the EVPI policies on the same datasets but with deliberately inaccurate cost or payoff values and check whether net utility falls below that of the recall-based baselines.

Figures

Figures reproduced from arXiv: 2606.07071 by Aaron H.A. Fletcher, Mark Stevenson.

Figure 1
Figure 1. Figure 1: Regret across cost regimes. Regret versus screening cost [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Error type decomposition for systematic review screening ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Deciding when to stop reviewing the results of a search is a common problem with multiple applications. Existing stopping rules developed within Technology-Assisted Review (TAR) aim to achieve a pre-specified recall target and do not take into account the reason for examining the results, potentially leading to sub-optimal recommendations. This paper applies decision theory to the problem and uses it to derive three practical stopping policies based on the Expected Value of Perfect Information. The approach is applied to two professional search tasks: patent examining and systematic reviewing. Experiments on CLEF-IP and medical systematic review datasets show that the proposed approach generally produces more appropriate stopping decisions than existing methods, as demonstrated by higher net utility under the evaluated cost and payoff settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes three stopping policies for document screening in professional search tasks (patent examination and systematic reviews) derived from the Expected Value of Perfect Information (EVPI) within a decision-theoretic framework. These policies are contrasted with existing recall-target methods from Technology-Assisted Review (TAR). Experiments on the CLEF-IP and medical systematic review datasets are reported to show generally higher net utility for the EVPI policies under the evaluated cost and payoff settings.

Significance. If the central claim holds after addressing parameter sensitivity, the work would provide a principled alternative to heuristic recall targets by explicitly incorporating task-specific costs and payoffs into stopping decisions. The use of two distinct real-world collections (CLEF-IP and systematic-review data) supplies a concrete empirical test of the approach.

major comments (2)
  1. [Experiments section (CLEF-IP and systematic-review results)] The central claim that the EVPI policies produce higher net utility rests on treating the cost and payoff parameters as known exactly when both deriving the stopping thresholds and computing the reported utilities. No sensitivity analysis to perturbations in these parameters is presented, which directly affects threshold reliability and the magnitude of the reported advantage.
  2. [Method section (EVPI policy derivations)] The EVPI derivation presupposes that the cost/payoff vector is known with sufficient accuracy for the value-of-information calculations to be stable; the manuscript provides no analysis of how uncertainty in these values propagates into the stopping decisions or net-utility differences.
minor comments (1)
  1. [Method section] Notation for the three EVPI policies could be introduced more explicitly with consistent symbols across the derivation and experimental tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments regarding parameter sensitivity. These observations correctly identify a gap in the current manuscript. We address each point below and commit to revisions that will strengthen the empirical support for the claims.

read point-by-point responses
  1. Referee: [Experiments section (CLEF-IP and systematic-review results)] The central claim that the EVPI policies produce higher net utility rests on treating the cost and payoff parameters as known exactly when both deriving the stopping thresholds and computing the reported utilities. No sensitivity analysis to perturbations in these parameters is presented, which directly affects threshold reliability and the magnitude of the reported advantage.

    Authors: We agree that the absence of sensitivity analysis limits the strength of the central claim. In the revised manuscript we will add a dedicated subsection to the Experiments section that systematically perturbs the cost and payoff parameters (e.g., multiplicative factors of 0.5, 0.8, 1.2, and 1.5) and reports the resulting net-utility differences and stopping decisions for the three EVPI policies versus the recall-target baselines on both CLEF-IP and the systematic-review collections. This will quantify the stability of the reported advantages. revision: yes

  2. Referee: [Method section (EVPI policy derivations)] The EVPI derivation presupposes that the cost/payoff vector is known with sufficient accuracy for the value-of-information calculations to be stable; the manuscript provides no analysis of how uncertainty in these values propagates into the stopping decisions or net-utility differences.

    Authors: The referee is correct that the derivations treat the cost/payoff vector as fixed. We will expand the Method section with a short discussion of this modeling assumption and its implications. The primary mitigation will be the empirical sensitivity analysis described above, which directly examines how perturbations affect both thresholds and net utilities. If space allows, we will also note that the EVPI formulas are differentiable in the parameters, permitting straightforward first-order propagation analysis, though the empirical results will constitute the main addition. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation from external decision theory and evaluation on independent datasets

full rationale

The paper derives EVPI-based stopping policies from standard decision theory assuming known costs/payoffs as inputs, then evaluates the resulting policies on external collections (CLEF-IP, medical systematic reviews) against recall-target baselines using net utility computed under those same parameters. This is consistent application of the method rather than reduction by construction: the baselines are not derived from the same parameters, the datasets are independent, and no equations or self-citations reduce the central claim to a tautology or fitted input. No self-definitional steps, fitted predictions, or load-bearing self-citations are present in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that decision-theoretic EVPI can be computed and applied to stopping decisions, plus the practical availability of cost and payoff values for the evaluated tasks.

free parameters (1)
  • cost and payoff settings
    Net utility is computed under specific evaluated cost and payoff values that are not derived from first principles.
axioms (1)
  • domain assumption Expected Value of Perfect Information provides a sound basis for deriving stopping policies in document screening
    Invoked to justify the three practical policies.

pith-pipeline@v0.9.1-grok · 5635 in / 1151 out tokens · 22841 ms · 2026-06-27T20:50:51.456305+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 34 canonical work pages

  1. [1]

    A. E. Ades, G. Lu, and K. Claxton. 2004. Expected Value of Sample Information Calculations in Medical Decision Modeling.Medical Decision Making24, 2 (March 2004), 207–227. doi:10.1177/0272989X04263162

  2. [2]

    Leif Azzopardi. 2011. The economics in interactive information retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval(Beijing, China)(SIGIR ’11). Association for Computing Machinery, New York, NY, USA, 15–24. doi:10.1145/2009916.2009923

  3. [3]

    Leif Azzopardi, Diane Kelly, and Kathy Brennan. 2013. How query cost affects search behavior. InProceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval(Dublin, Ireland)(SIGIR ’13). Association for Computing Machinery, New York, NY, USA, 23–32. doi:10. 1145/2484028.2484049

  4. [4]

    R. E. Barlow and H. D. Brunk. 1972. The Isotonic Regression Problem and its Dual.J. Amer. Statist. Assoc.67, 337 (March 1972), 140–147. doi:10.1080/01621459. 1972.10481216

  5. [5]

    Reem Bin-Hezam and Mark Stevenson. 2024. RLStop: A Reinforcement Learning Stopping Method for TAR. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 2604–2608. doi:10.1145/3626772.3657911

  6. [6]

    2006.Decision Modelling For Health Economic Evaluation

    Andrew Briggs, Karl Claxton, and Mark Sculpher. 2006.Decision Modelling For Health Economic Evaluation. Oxford University PressOxford. doi:10.1093/oso/ 9780198526629.001.0001

  7. [7]

    Max W Callaghan and Finn Müller-Hansen. 2020. Statistical stopping criteria for automated screening in systematic reviews.Systematic Reviews9, 1 (Dec. 2020),

  8. [8]

    doi:10.1186/s13643-020-01521-4

  9. [9]

    Yuan Shih Chow, Herbert Robbins, David Siegmund, and Yuan Shih Chow. 1971. Great expectations: The theory of optimal stopping. Houghton Mifflin, Boston

  10. [10]

    Karl Claxton. 1999. The irrelevance of inference: a decision-making approach to the stochastic evaluation of health care technologies.Journal of Health Economics 18, 3 (June 1999), 341–364. doi:10.1016/S0167-6296(98)00039-3

  11. [11]

    William S. Cooper. 1973. On selecting a measure of retrieval effectiveness part II. Implementation of the philosophy.Journal of the American Society for Information Science24, 6 (Nov. 1973), 413–424. doi:10.1002/asi.4630240603

  12. [12]

    Cormack and Maura R

    Gordon V. Cormack and Maura R. Grossman. 2014. Evaluation of machine- learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th international ACM SIGIR conference on Research & develop- ment in information retrieval. ACM, Gold Coast Queensland Australia, 153–162. doi:10.1145/2600428.2609601

  13. [13]

    Cormack and Maura R

    Gordon V. Cormack and Maura R. Grossman. 2016. Engineering Quality and Reliability in Technology-Assisted Review. InProceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy)(SIGIR ’16). Association for Computing Machinery, New York, NY, USA, 75–84. doi:10.1145/2911451.2911510

  14. [14]

    Cormack and Maura R

    Gordon V. Cormack and Maura R. Grossman. 2018. The Quest for Total Recall. In Proceedings of the ACM Symposium on Document Engineering 2018. ACM, Halifax NS Canada, 1–2. doi:10.1145/3209280.3232788

  15. [15]

    Giorgio Maria Di Nunzio. 2018. A Study of an Automatic Stopping Strategy for Technologically Assisted Medical Reviews. InAdvances in Information Retrieval, Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Springer International Publishing, Cham, 672–677

  16. [16]

    John M. Dwyer. 2007. Howard Raiffa and Robert Schlaifer. Applied statistical decision theory. Boston: Clinton Press, Inc., 1961. 356 pages.Behavioral Science 7, 1 (Jan. 2007), 103–104. doi:10.1002/bs.3830070108

  17. [17]

    Ferguson

    Thomas S. Ferguson. 1989. Who Solved the Secretary Problem?Statist. Sci.4, 3 (Aug. 1989), 294–296. doi:10.1214/ss/1177012493

  18. [18]

    Aaron Fletcher and Mark Stevenson. 2026. Confidence-Based Stopping Methods for Systematic Reviews. InProceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’26) (Melbourne, VIC, Australia). Association for Computing Machinery, New York, NY, USA

  19. [19]

    J. C. Gittins. 1979. Bandit Processes and Dynamic Allocation Indices.Journal of the Royal Statistical Society Series B: Statistical Methodology41, 2 (Jan. 1979), 148–164. doi:10.1111/j.2517-6161.1979.tb01068.x

  20. [20]

    C. A. E. Goodhart. 1984. Problems of Monetary Management: The UK Experience. InMonetary Theory and Practice. Macmillan Education UK, London, 91–121. doi:10.1007/978-1-349-17295-5_4

  21. [21]

    Grossman, Gordon V

    Maura R. Grossman, Gordon V. Cormack, and Adam Roegiest. 2016. TREC 2016 Total Recall Track Overview. InText Retrieval Conference. https://api. semanticscholar.org/CorpusID:5826060

  22. [22]

    Anna Heath and Gianluca Baio. 2018. Calculating the Expected Value of Sample Information Using Efficient Nested Monte Carlo: A Tutorial.Value in Health21, 11 (2018), 1299–1304. doi:10.1016/j.jval.2018.05.004

  23. [23]

    JPT Higgins, J Chandler, M Cumpston, T Li, MJ Page, and VA Welch. 2024. Cochrane Handbook for Systematic Reviews of Interventions. Vol. 6.5. Cochrane. www.cochrane.org/handbook

  24. [24]

    Ronald Howard. 1966. Information Value Theory.IEEE Transactions on Systems Science and Cybernetics2, 1 (1966), 22–26. doi:10.1109/TSSC.1966.300074

  25. [25]

    2011.Finding what works in health care: standards for systematic reviews

    Institute of Medicine (U.S.) and Jill Eden (Eds.). 2011.Finding what works in health care: standards for systematic reviews. National Academies Press, Washington, D.C

  26. [26]

    Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2017. CLEF 2017 technologically assisted reviews in empirical medicine overview, In 18th Working Notes of CLEF Conference and Labs of the Evaluation Forum.CEUR Workshop Proceedings1866, 1–29. https://www.scopus.com/inward/record.uri?eid=2-s2.0- 85034732447&partnerID=40&md5=a183b346edceb1918338a...

  27. [27]

    Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2018. CLEF 2018 technologically assisted reviews in empirical medicine overview, In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018.CEUR Workshop Proceedings2125. https:// strathprints.strath.ac.uk/66446/

  28. [28]

    Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2019. CLEF 2019 technology assisted reviews in empirical medicine overview, In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019.CEUR Workshop Proceedings2380. https: //strathprints.strath.ac.uk/71253/

  29. [29]

    Donald H Kraft and T Lee. 1979. Stopping rules and their effect on expected search length.Information Processing and Management15, 1 (1979), 47–58. doi:10.1016/0306-4573(79)90007-4 Decision-Theoretic Stopping Rules for Document Screening

  30. [30]

    Lewis, Eugene Yang, and Ophir Frieder

    David D. Lewis, Eugene Yang, and Ophir Frieder. 2021. Certifying One-Phase Technology-Assisted Reviews. InProceedings of the 30th ACM International Con- ference on Information & Knowledge Management(Virtual Event, Queensland, Australia)(CIKM ’21). Association for Computing Machinery, New York, NY, USA, 893–902. doi:10.1145/3459637.3482415

  31. [31]

    Dan Li and Evangelos Kanoulas. 2020. When to Stop Reviewing in Technology- Assisted Reviews: Sampling from an Adaptive Distribution to Estimate Residual Relevant Documents.ACM Trans. Inf. Syst.38, 4, Article 41 (Sept. 2020), 36 pages. doi:10.1145/3411755

  32. [32]

    Parvaz Mahdabi, Mostafa Keikha, Shima Gerani, Monica Landoni, and Fabio Crestani. 2011. Building Queries for Prior-Art Search. InMultidisciplinary Infor- mation Retrieval, Allan Hanbury, Andreas Rauber, and Arjen P. de Vries (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 3–15

  33. [33]

    David Maxwell, Leif Azzopardi, Kalervo Järvelin, and Heikki Keskustalo. 2015. Searching and Stopping: An Analysis of Stopping Rules and Strategies. InProceed- ings of the 24th ACM International on Conference on Information and Knowledge Management(Melbourne, Australia)(CIKM ’15). Association for Computing Machinery, New York, NY, USA, 313–322. doi:10.1145...

  34. [34]

    J. J. McCall. 1970. Economics of Information and Job Search.The Quarterly Journal of Economics84, 1 (Feb. 1970), 113. doi:10.2307/1879403

  35. [35]

    Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness.ACM Transactions on Information Systems27, 1 (Dec. 2008), 1–27. doi:10.1145/1416950.1416952

  36. [36]

    Alessio Molinari and Andrea Esuli. 2024. SALT: efficiently stopping TAR by improving priors estimates.Data Mining and Knowledge Discovery38, 2 (March 2024), 535–568. doi:10.1007/s10618-023-00961-5

  37. [37]

    Christopher Norman, Mariska Leeflang, and Aurélie Névéol. 2018. Data Ex- traction and Synthesis in Systematic Reviews of Diagnostic Test Accuracy: A Corpus for Automating and Evaluating the Process.AMIA ... Annual Symposium proceedings. AMIA Symposium2018 (2018), 817–826

  38. [38]

    Norman, Mariska M

    Christopher R. Norman, Mariska M. G. Leeflang, Raphaël Porcher, and Aurélie Névéol. 2019. Measuring the impact of screening automation on meta-analyses of diagnostic test accuracy.Systematic Reviews8, 1 (2019). doi:10.1186/s13643- 019-1162-x Publisher: Springer Science and Business Media LLC

  39. [39]

    Florina Piroi, Giovanna Roda, Veronika Zenz, and John Tait. 2021. The CLEF-IP 2009 Test Collection. doi:10.48436/9SXBQ-JS515

  40. [40]

    Reitsma, Afina S

    Johannes B. Reitsma, Afina S. Glas, Anne W.S. Rutjes, Rob J.P.M. Scholten, Patrick M. Bossuyt, and Aeilko H. Zwinderman. 2005. Bivariate analysis of sensitivity and specificity produces informative summary measures in di- agnostic reviews.Journal of Clinical Epidemiology58, 10 (2005), 982–990. doi:10.1016/j.jclinepi.2005.02.022

  41. [41]

    Robertson

    S.E. Robertson. 1977. The Probability Ranking Principle in IR.Journal of Docu- mentation33, 4 (April 1977), 294–304. doi:10.1108/eb026647

  42. [42]

    Giovanna Roda, John Tait, Florina Piroi, and Veronika Zenz. 2010. CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain. InMultilingual Information Access Evaluation I. Text Retrieval Experiments, Carol Peters, Gior- gio Maria Di Nunzio, Mikko Kurimo, Thomas Mandl, Djamel Mostefa, Anselmo Peñas, and Giovanna Roda (Eds.). Springer Berlin ...

  43. [43]

    Cormack, Charles L

    Adam Roegiest, Gordon V. Cormack, Charles L. A. Clarke, and Maura R. Gross- man. 2015. TREC 2015 Total Recall Track Overview. InProceedings of The Twenty-Fourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA, November 17-20, 2015, Ellen M. Voorhees and Angela Ellis (Eds.), Vol. Spe- cial Publication 500-319. National Institute of Standa...

  44. [44]

    Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. 2011. Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior. In2011 31st International Conference on Distributed Computing Systems Workshops. IEEE, Minneapolis, MN, USA, 166–171. doi:10.1109/ICDCSW.2011.20

  45. [45]

    Leonard J. Savage. 1954. The foundations of statistics. By Leonard J. Savage, John Wiley & Sons, Inc., 1954, 294 pp.Naval Research Logistics Quarterly1, 3 (1954), 236–236. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/nav.3800010316 doi:10.1002/nav.3800010316

  46. [46]

    Mark Stevenson and Reem Bin-Hezam. 2023. Stopping Methods for Technology- assisted Reviews Based on Point Processes.ACM Trans. Inf. Syst.42, 3 (Dec. 2023), 73. doi:10.1145/3631990

  47. [47]

    George J. Stigler. 1961. The Economics of Information.Journal of Political Economy69, 3 (June 1961), 213–225. doi:10.1086/258464

  48. [48]

    Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and Language Models Examined. InProceedings of the 19th Australasian Document Computing Symposium(Melbourne, VIC, Australia)(ADCS ’14). As- sociation for Computing Machinery, New York, NY, USA, 58–65. doi:10.1145/ 2682862.2682863

  49. [49]

    2007.Theory of games and economic behavior(60

    John Von Neumann and Oskar Morgenstern. 2007.Theory of games and economic behavior(60. anniversary ed., 4. print., and 1. paperb. print ed.). Princeton University Press, Princeton, NJ

  50. [50]

    A. Wald. 1945. Sequential Tests of Statistical Hypotheses.The Annals of Mathe- matical Statistics16, 2 (June 1945), 117–186. doi:10.1214/aoms/1177731118

  51. [51]

    Milton Weinstein and Richard Zeckhauser. 1973. Critical ratios and efficient allocation.Journal of Public Economics2, 2 (April 1973), 147–157. doi:10.1016/ 0047-2727(73)90002-9

  52. [52]

    Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into ac- curate multiclass probability estimates. InProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Edmon- ton Alberta Canada, 694–699. doi:10.1145/775047.775151