pith. sign in

arxiv: 2604.25349 · v1 · submitted 2026-04-28 · 💻 cs.IR · stat.AP· stat.ME

Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research

Pith reviewed 2026-05-07 15:23 UTC · model grok-4.3

classification 💻 cs.IR stat.APstat.ME
keywords Wilcoxon signed-rank testInformation Retrieval evaluationType I errorstatistical testingnon-parametric methodsbenchmarkingsignificance testing
0
0 comments X

The pith

The Wilcoxon signed-rank test fails to control Type I error in typical IR evaluations and should be abandoned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that IR researchers treat the Wilcoxon signed-rank test as a reliable non-parametric option for comparing retrieval systems, based on the idea that metric scores are not normally distributed. This view traces to inconsistent textbook advice that has encouraged routine misapplication. Analysis and demonstrations reveal that the test loses control of false-positive rates under the conditions common in IR benchmarking. If correct, the field would gain more trustworthy significance claims by switching away from it.

Core claim

The central claim is that the Wilcoxon signed-rank test, when applied to paired IR metric differences, routinely exceeds its nominal Type I error rate because the discrete and bounded nature of the scores violates the conditions required for the test's asymptotic properties to hold. This problem is compounded by decades of guidance that presented Wilcoxon as a safe default without acknowledging its sensitivity in small-sample, tied-data settings typical of IR.

What carries the argument

The Wilcoxon signed-rank test applied to differences in retrieval effectiveness metrics, whose discrete distributions and frequent ties cause the test statistic's distribution to deviate from the assumptions needed for proper p-value calibration.

If this is right

  • Past IR papers that relied on Wilcoxon results would need re-examination for overstated significance.
  • New evaluation guidelines would recommend tests that maintain error control under the discrete score distributions found in practice.
  • Textbook presentations of non-parametric tests would require updates to clarify when Wilcoxon remains valid.
  • IR benchmarking workflows would shift toward alternatives that do not introduce uncontrolled false positives.
  • Training for researchers would emphasize checking the actual operating characteristics of a chosen test rather than its nominal category.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar misapplications may occur in other fields that compare bounded performance scores across small numbers of trials.
  • The episode illustrates how an initial statistical recommendation can persist long after its assumptions cease to hold in the target domain.
  • Adoption of corrected practices could reduce the rate of irreproducible findings in comparative retrieval studies.
  • Development of domain-specific simulation tools would help researchers verify test behavior before use.

Load-bearing premise

The test collections and evaluation protocols used for the demonstrations are representative of the broader range of IR experiments.

What would settle it

A broad set of IR experiments in which the observed rejection rate under the null hypothesis stays at or below the nominal alpha level across multiple metrics and system pairs would falsify the claim that the test routinely breaks down.

Figures

Figures reproduced from arXiv: 2604.25349 by Juli\'an Urbano.

Figure 1
Figure 1. Figure 1: Effect of asymmetry, tail heaviness, discrete support and multimodality of view at source ↗
Figure 2
Figure 2. Figure 2: Symmetry and tail heaviness observed in 𝑫 distributions from TREC data at 𝒏 = 50. For reference, dashed lines represent the sampling distributions expected if 𝑫 were normally distributed, and likewise dotted lines if they had tails as heavy as observed but still symmetric. The largest observed values are |𝜸 | ≈ 7 and 𝜿 ≈ 45, but the axes are trimmed for clarity. treating heavier and lighter tails separatel… view at source ↗
read the original abstract

In benchmarking of Information Retrieval systems, the Wilcoxon signed-rank test is often treated as a safer alternative to the t-test. This belief is fueled by textbooks and recommendations that portray Wilcoxon as the proper non-parametric alternative because metric scores are not normally distributed. We argue that this narrative is misleading and harmful. A careful review of Statistics textbooks reveals inconsistencies and omissions in how the assumptions underlying these tests are presented, fostering confusion that has propagated into IR research. As a result, Wilcoxon has been routinely misapplied for decades, creating a false sense of safety against a threat that was never there to begin with, while introducing another one so severe that it virtually guarantees the test will break down and mislead researchers. Through a combination of systematic literature review, analysis and empirical demonstrations with TREC data, we show how and why the Wilcoxon test easily loses control of its Type I error rate in IR settings. We conclude that the continued use of Wilcoxon in IR evaluation is unjustified and that abandoning it would improve the methodological soundness of our field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that the Wilcoxon signed-rank test is routinely misused in IR evaluation due to misleading textbook portrayals of it as a safe non-parametric alternative to the t-test for non-normal metric scores. Through a literature review of statistical assumptions, analysis of violations in IR contexts, and empirical simulations on TREC data, the authors demonstrate that the test frequently loses control of its Type I error rate, and conclude that its continued use is unjustified and should be abandoned to improve methodological soundness.

Significance. If the central empirical claim holds, this would be a significant contribution to IR methodology, directly challenging a practice recommended in many textbooks and used across numerous papers. The combination of textbook analysis with concrete TREC-based simulations that measure Type I error inflation provides reproducible evidence on public data, strengthening the case for re-evaluating standard practices. The result, if generalizable, could lead to more reliable statistical comparisons in system benchmarking.

major comments (2)
  1. [Empirical section (simulations)] The empirical demonstrations (likely §4 or equivalent) show Type I error inflation on TREC collections, but the simulation protocol details—such as exact null distribution sampling, tie-breaking rules for discrete metrics like AP, and number of repetitions—are insufficient to fully assess robustness. This is load-bearing for the claim that Wilcoxon 'virtually guarantees' breakdown.
  2. [Conclusions] The strong conclusion to abandon Wilcoxon field-wide rests on TREC setups with fixed ~50 topics and bounded discrete metrics that produce ties/near-zero differences. The paper should test or explicitly discuss whether the observed inflation generalizes to other common IR scenarios (e.g., larger topic sets or continuous metrics) to support the broad recommendation.
minor comments (2)
  1. [Literature review] The literature review could benefit from more precise page or section citations when pointing out inconsistencies in specific statistics textbooks.
  2. [Figures] Figures showing Type I error rates would be clearer with added confidence intervals or variability measures across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for acknowledging the potential significance of our work. We will revise the manuscript to address the concerns raised regarding the empirical simulations and the generalizability of our conclusions.

read point-by-point responses
  1. Referee: [Empirical section (simulations)] The empirical demonstrations (likely §4 or equivalent) show Type I error inflation on TREC collections, but the simulation protocol details—such as exact null distribution sampling, tie-breaking rules for discrete metrics like AP, and number of repetitions—are insufficient to fully assess robustness. This is load-bearing for the claim that Wilcoxon 'virtually guarantees' breakdown.

    Authors: We agree that additional details on the simulation protocol are necessary to allow full assessment of our empirical claims. In the revised manuscript, we will expand the description in the empirical section to include precise information on how the null distribution is sampled, the specific tie-breaking rules applied for discrete metrics such as AP, and the number of repetitions used in the simulations. This will strengthen the reproducibility and robustness evaluation of the Type I error results. revision: yes

  2. Referee: [Conclusions] The strong conclusion to abandon Wilcoxon field-wide rests on TREC setups with fixed ~50 topics and bounded discrete metrics that produce ties/near-zero differences. The paper should test or explicitly discuss whether the observed inflation generalizes to other common IR scenarios (e.g., larger topic sets or continuous metrics) to support the broad recommendation.

    Authors: We acknowledge that our primary empirical evidence comes from TREC collections with approximately 50 topics and discrete metrics. To support the recommendation more broadly, we will add an explicit discussion in the conclusions section addressing the generalizability to other scenarios, such as larger numbers of topics or continuous metrics. We will explain why the identified problems (ties and small differences) are likely to persist in many IR evaluation settings but will also note the limitations of our current experiments and suggest avenues for future validation. This will temper the conclusion appropriately while maintaining the core argument based on the evidence presented. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper builds its case via external statistical theory from textbooks, a systematic literature review of IR practices, and fresh empirical simulations on public TREC collections using standard metrics like AP and nDCG. These steps rely on independent data and established assumptions rather than any fitted parameters, self-referential definitions, or load-bearing self-citations. The central claim that Wilcoxon loses Type I error control in IR settings does not reduce to the paper's own inputs by construction; the TREC-based demonstrations are reproducible external benchmarks. Minor self-citations to prior IR stats work are present but not load-bearing for the main argument.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard statistical assumptions about the Wilcoxon test's requirements for independence and continuous distributions, plus the representativeness of TREC topic-level scores. No new entities are postulated and no parameters are fitted to produce the main result.

axioms (2)
  • standard math Wilcoxon signed-rank test requires independent observations and continuous underlying distributions for its Type I error guarantee to hold exactly.
    Invoked when explaining why the test breaks down in IR settings with topic-level dependence and ties.
  • domain assumption TREC-style evaluation collections produce score distributions and dependence structures that are typical of IR benchmarking practice.
    Used to generalize the empirical demonstrations to the broader field.

pith-pipeline@v0.9.0 · 5475 in / 1416 out tokens · 56133 ms · 2026-05-07T15:23:01.853986+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 53 canonical work pages

  1. [1]

    Anderson, Dennis J

    David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm, James J. Cochran, Michael J. Fry, and Jeffrey W. Ohlmann. 2019.Statistics for Business and Economics(14 ed.). Cengage Learning

  2. [2]

    Coups, and Elaine N

    Arthur Aron, Elliot J. Coups, and Elaine N. Aron. 2013.Statistics for Psychology (6 ed.). Pearson. 744 pages

  3. [3]

    2014.Probability and Statistics for Computer Scientists(2 ed.)

    Michael Baron. 2014.Probability and Statistics for Computer Scientists(2 ed.). CRC Press. 473 pages

  4. [4]

    Johnson, Shahin Hashtroudi, and Stephen L

    R. Clifford Blair and James J. Higgins. 1985. Comparison of the Power of the Paired Samples t test to that of Wilcoxon’s Signed-ranks Test Under Various Population Ahapes.Psychological Bulletin97, 1 (1985), 119–128. doi:10.1037/0033- 2909.97.1.119

  5. [5]

    David Bodoff. 2008. Test Theory for Evaluating Reliability of IR Test Collections. Information Processing and Management44, 3 (2008), 1117–1145. doi:10.1016/j. ipm.2007.11.006

  6. [6]

    Alan Boneau

    C. Alan Boneau. 1960. The Effects of Violations of Assumptions Underlying the t test.Psychological Bulletin57, 1 (1960), 49–64. doi:10.1037/h0041412

  7. [7]

    George E. P. Box. 1953. Non-Normality and Tests on Variances.Biometrika40, 3/4 (1953), 318. doi:10.2307/2333350

  8. [8]

    George E. P. Box, J. Stuart Hunter, and William G. Hunter. 2005.Statistics for Experimenters: Design, Innovation and Discovery(2 ed.). Wiley

  9. [9]

    Ben Carterette. 2012. Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments.ACM Transactions on Information Systems 30, 1 (2012). doi:10.1145/2094072.2094076

  10. [10]

    Ben Carterette. 2015. Bayesian Inference for Information Retrieval Evaluation. InInternational Conference on the Theory of Information Retrieval. 31–40. doi:10. 1145/2808194.2809469

  11. [11]

    Ben Carterette. 2017. But Is It Statistically Significant?. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 1125–1128. doi:10.1145/3077136.3080738

  12. [12]

    Rocío Cañamares, Pablo Castells, and Alistair Moffat. 2020. Offline Evaluation Options for Recommender Systems.Information Retrieval Journal23, 4 (2020), 387–410. doi:10.1007/s10791-020-09371-3

  13. [13]

    Chaffin and Steven G

    Wilkie W. Chaffin and Steven G. Rhiel. 1993. The Effect of Skewness and Kurtosis on the One-Sample T Test and the Impact of Knowledge of the Population Standard Deviation.Journal of Statistical Computation and Simulation46, 1-2 (1993), 79–90. doi:10.1080/00949659308811494

  14. [14]

    Cicchitelli

    G. Cicchitelli. 1989. On the Robustness of the One-Sample t Test.Journal of Statistical Computation and Simulation32, 4 (1989), 249–258. doi:10.1080/ 00949658908811181

  15. [15]

    2018.Research Methods in Education(8 ed.)

    Louis Cohen, Lawrence Manion, and Keith Morrison. 2018.Research Methods in Education(8 ed.). Routledge

  16. [16]

    William Jay Conover. 1973. On Methods of Handling Ties in the Wilcoxon Signed-Rank Test.J. Amer. Statist. Assoc.68, 344 (1973), 985–988. doi:10.1080/ 01621459.1973.10481460

  17. [17]

    William Jay Conover. 1973. Rank Tests for One Sample, Two Samples, and k Samples Without the Assumption of a Continuous Distribution Function.The Annals of Statistics1, 6 (1973), 1105–1125

  18. [18]

    W. J. Conover. 1999.Practical Nonparametric Statistics(3 ed.). Wiley. doi:10.2307/ 1271101

  19. [19]

    Clyde H. Coombs. 1950. Psychological Scaling Without a Unit of Measurement. Psychological Review57, 3 (1950), 145–158. doi:10.1037/h0060984

  20. [20]

    Cormack and Thomas R

    Gordon V. Cormack and Thomas R. Lynam. 2007. Validity and Power of t-test for Comparing MAP and GMAP. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 753–754. doi:10.1145/1277741.1277892

  21. [21]

    F. M. Dekking, C. Kraaikamp, H. P. Lopuhaä, and L. E. Meester. 2005.A Modern Introduction to Probability and Statistics: Understanding Why and How(1 ed.). Springer

  22. [22]

    Guglielmo Faggioli, Nicola Ferro, and Norbert Fuhr. 2022. Detecting Significant Differences Between Information Retrieval Systems via Generalized Linear Mod- els. InACM International Conference on Information and Knowledge Management. 446–456. doi:10.1145/3511808.3557286

  23. [23]

    Carmen Fernández and Mark F. J. Steel. 1998. On Bayesian Modeling of Fat Tails and Skewness.J. Amer. Statist. Assoc.93, 441 (1998), 359–371. doi:10.2307/2669632

  24. [24]

    Marco Ferrante, Nicola Ferro, and Norbert Fuhr. 2021. Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales. IEEE Access9 (2021), 136182–136216. doi:10.1109/access.2021.3116857

  25. [25]

    Nicola Ferro, Yubin Kim, and Mark Sanderson. 2019. Using Collection Shards to Study Retrieval Performance Effect Sizes.ACM Transactions on Information Systems37, 3 (2019). doi:10.1145/3310364

  26. [26]

    Nicola Ferro and Gianmaria Silvello. 2016. A General Linear Mixed Mod- els Approach to Study System Component Effects. InInternational ACM SI- GIR conference on Research and Development in Information Retrieval. 25–34. doi:10.1145/2911451.2911530

  27. [27]

    Norbert Fuhr. 2018. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided.ACM SIGIR Forum51, 3 (2018), 32–41. doi:10.1145/3190580.3190586 Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia

  28. [28]

    Frank E. Harrell. 2015.Regression Modeling Strategies, with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis(2 ed.). Springer

  29. [29]

    J. A. Hartigan and P. M. Hartigan. 1985. The Dip Test of Unimodality.The Annals of Statistics13, 1 (1985), 70–84. doi:10.1214/aos/1176346577

  30. [30]

    Headrick, Rhonda K

    Todd C. Headrick, Rhonda K. Kowalchuk, and Yanyan Sheng. 2008. Parametric Probability Densities and Distribution Functions for Tukey g-and-h Transforma- tions and Their Use for Fitting Data.Applied Mathematical Sciences2, 9 (2008), 449–462

  31. [31]

    Hettmansperger

    Thomas P. Hettmansperger. 1984.Statistical Inference Based on Ranks. Wiley

  32. [32]

    Myles Hollander and Douglas A. Wolfe. 1973.Nonparametric Statistical Methods (1 ed.). Wiley

  33. [33]

    David Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experi- ments. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 329–338. doi:10.1145/160688.160758

  34. [34]

    Gopal K. Kanji. 2006.100 Statistical Tests(3 ed.). Sage

  35. [35]

    E. L. Lehmann and Joseph P. Romano. 2022.Testing Statistical Hypotheses(4 ed.). Springer

  36. [36]

    Heng Li and Terri Johnson. 2014. Wilcoxon’s Signed-rank Statistic: What Null Hypothesis and Why it Matters.Pharmaceutical Statistics13, 5 (2014), 281–285. doi:10.1002/pst.1628

  37. [37]

    Thomas Lumley, Paula Diehr, Scott Emerson, and Lu Chen. 2002. The Importance of the Normality Assumption in Large Public Health Data Sets.Annual Review of Public Health23, 1 (2002), 151–169. doi:10.1146/annurev.publhealth.23.100901. 140546

  38. [38]

    Bryan F. J. Manly. 2008.Statistics for Environmental Science and Management(2 ed.). CRC Press

  39. [39]

    Beaver, and Robert J

    William Mendenhall, Barbara M. Beaver, and Robert J. Beaver. 2018.Introduction to Probability and Statistics(15 ed.). Cengage

  40. [40]

    Theodore Micceri. 1989. The Unicorn, the Normal Curve, and Other Improbable Creatures.Psychological Bulletin105, 1 (1989), 156–166. doi:10.1037/0033-2909. 105.1.156

  41. [41]

    Joel Michell. 1986. Measurement Scales and Statistics: A Clash of Paradigms. Psychological Bulletin100, 3 (1986), 398–407. doi:10.1037/0033-2909.100.3.398

  42. [42]

    Alistair Moffat. 2022. Batch Evaluation Metrics in Information Retrieval: Mea- sures, Scales, and Meaning.IEEE Access10 (2022), 105564–105577. doi:10.1109/ access.2022.3211668

  43. [43]

    Montgomery and George C

    Douglas C. Montgomery and George C. Runger. 2014.Applied Statistics and Probability for Engineers(6 ed.). Wiley

  44. [44]

    Saralees Nadarajah. 2005. A Generalized Normal Distribution.Journal of Applied Statistics32, 7 (2005), 685–694. doi:10.1080/02664760500079464

  45. [45]

    2012.Nonparametric Statistical Tests: A Computational Ap- proach

    Markus Neuhäuser. 2012.Nonparametric Statistical Tests: A Computational Ap- proach. CRC Press

  46. [46]

    Lyman Ott and Michael Longnecker

    R. Lyman Ott and Michael Longnecker. 2015.An Introduction to Statistical Methods and Data Analysis(7 ed.). Cengage

  47. [47]

    Losada, and Álvaro Barreiro

    Javier Parapar, David E. Losada, and Álvaro Barreiro. 2021. Testing the Tests: Simulation of Rankings to Compare Statistical Significance Tests in Information Retrieval Evaluation. InACM/SIGAPP Symposium on Applied Computing. 655–664. doi:10.1145/3412841.3441945

  48. [48]

    Losada, Manuel A

    Javier Parapar, David E. Losada, Manuel A. Presedo Quindimil, and Álvaro Bar- reiro. 2020. Using Score Distributions to Compare Statistical Significance Tests for Information Retrieval Evaluation.Journal of the Association for Information Science and Technology71, 1 (2020), 98–113. doi:10.1002/asi.24203

  49. [49]

    John W. Pratt. 1959. Remarks on Zeros and Ties in the Wilcoxon Signed Rank Procedures.J. Amer. Statist. Assoc.54, 287 (1959), 655–667. doi:10.1080/01621459. 1959.10501526

  50. [50]

    Pratt and Jean D

    John W. Pratt and Jean D. Gibbons. 1981.Concepts of Nonparametric Theory(1 ed.). Springer

  51. [51]

    Privitera

    Gregory J. Privitera. 2014.Statistics for the Behavioral Sciences(3 ed.). Sage

  52. [52]

    John A. Rice. 2007.Mathematical Statistics and Data Analysis(3 ed.). Duxbury. 650 pages

  53. [53]

    Tetsuya Sakai. 2006. Evaluating Evaluation Metrics Based on the Bootstrap. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 525–532. doi:10.1145/1148170.1148261

  54. [54]

    Tetsuya Sakai. 2016. Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015. InInternational ACM SI- GIR Conference on Research and Development in Information Retrieval. 5–14. doi:10.1145/2911451.2911492

  55. [55]

    Tetsuya Sakai. 2016. Two Sample T-tests for IR Evaluation: Student or Welch?. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 1045–1048. doi:10.1145/2911451.2914684

  56. [56]

    Tetsuya Sakai. 2020. On Fuhr’s Guideline for IR Evaluation.ACM SIGIR Forum 54, 1 (2020), 1–8. doi:10.1145/3451964.3451976

  57. [57]

    Mark Sanderson and Justin Zobel. 2005. Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 162–169. doi:10.1145/1076034. 1076064

  58. [58]

    Jacques Savoy. 1997. Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing and Management33, 4 (1997), 495–512. doi:10.1016/s0306- 4573(97)00027-7

  59. [59]

    Sawilowsky and R

    Shlomo S. Sawilowsky and R. Clifford Blair. 1992. A More Realistic Look at the Robustness and Type II Error Properties of the t Test to Departures from Population Normality.Quantitative Methods in Psychology111, 2 (1992), 352–360. doi:10.1037//0033-2909.111.2.352

  60. [60]

    David J. Sheskin. 2000.Handbook of Parametric and Nonparametric Statistical Procedures(2 ed.). Chapman & Hall

  61. [61]

    1956.Nonparametric Statistics for the Behavioral Sciences(1 ed.)

    Sidney Siegel. 1956.Nonparametric Statistics for the Behavioral Sciences(1 ed.). McGraw Hill

  62. [62]

    B. W. Silverman. 1981. Using Kernel Density Estimates to Investigate Multimodal- ity.Journal of the Royal Statistical Society Series B: Statistical Methodology43, 1 (1981), 97–99. doi:10.1111/j.2517-6161.1981.tb01155.x

  63. [63]

    Smucker, James Allan, and Ben Carterette

    Mark D. Smucker, James Allan, and Ben Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. InACM In- ternational Conference on Information and Knowledge Management. 623–632. doi:10.1145/1321440.1321528

  64. [64]

    Smucker, James Allan, and Ben Carterette

    Mark D. Smucker, James Allan, and Ben Carterette. 2009. Agreement Among Statistical Significance Tests for Information Retrieval Evaluation at Varying Sample Sizes. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 630–631. doi:10.1145/1571941.1572050

  65. [65]

    Sokal and F

    Robert R. Sokal and F. James Rohlf. 1995.Biometry: The Principles and Practice of Statistics in Biological Research(3 ed.). W.H. Freeman

  66. [66]

    Student. 1908. The Probable Error of a Mean.Biometrika6, 1 (1908), 1–25. doi:10.2307/2331554

  67. [67]

    M. Th. Subbotin. 1923. On the Law of Frequency of Error.Matematicheskii Sbornik31, 2 (1923), 296–301

  68. [68]

    Jean Tague-Sutcliffe. 1992. The Pragmatics of Information Retrieval Experimenta- tion, Revisited.Information Processing and Management28, 4 (jul 1992), 467–490. doi:10.1016/0306-4573(92)90005-k

  69. [69]

    John W. Tukey. 1977. Modern Techniques in Data Analysis. InNSF-sponsored regionalresearch conference at Southern Massachusetts University

  70. [70]

    Julián Urbano, Matteo Corsi, and Alan Hanjalic. 2021. How do Metric Score Distributions Affect the Type I Error Rate of Statistical Significance Tests in Information Retrieval?. InACM SIGIR International Conference on the Theory of Information Retrieval. 245–250. doi:10.1145/3471158.3472242

  71. [71]

    Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical Significance Test- ing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 505–514. doi:10.1145/3331184.3331259

  72. [72]

    Julián Urbano, Mónica Marrero, and Diego Martín. 2013. A Comparison of the Op- timality of Statistical Significance Tests for Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 925–928. doi:10.1145/2484028.2484163

  73. [73]

    Julián Urbano and Thomas Nagler. 2018. Stochastic Simulation of Test Collec- tions: Evaluation Scores. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 695–704. doi:10.1145/3209978.3210043

  74. [74]

    2001.Doing Science: Design, Analysis and Communication of Scientific Research(1 ed.)

    Ivan Valiela. 2001.Doing Science: Design, Analysis and Communication of Scientific Research(1 ed.). Oxford University Press. doi:10.1093/oso/9780195079623.001. 0001

  75. [75]

    van Rijsbergen

    Cornelis J. van Rijsbergen. 1979.Information Retrieval. Butterworths. doi:10. 1145/511829.511831

  76. [76]

    Voorhees and Chris Buckley

    Ellen M. Voorhees and Chris Buckley. 2002. The Effect of Topic Set Size on Retrieval Experiment Error. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 316–323. doi:10.1145/564376.564432

  77. [77]

    Voorhees, Daniel Samarov, and Ian Soboroff

    Ellen M. Voorhees, Daniel Samarov, and Ian Soboroff. 2017. Using Replicates in Information Retrieval Evaluation.ACM Transactions on Information Systems36, 2 (2017). doi:10.1145/3086701

  78. [78]

    2006.All of Nonparametric Statistics(1 ed.)

    Larry Wasserman. 2006.All of Nonparametric Statistics(1 ed.). Springer

  79. [79]

    2003.Exact Statistical Methods for Data Analysis(1 ed.)

    Samaradasa Weerahandi. 2003.Exact Statistical Methods for Data Analysis(1 ed.). Springer

  80. [80]

    John Wilbur

    W. John Wilbur. 1994. Non-parametric Ssignificance Tests of Retrieval Per- formance Comparisons.Journal of Information Science20, 4 (1994), 270–284. doi:10.1177/016555159402000405

Showing first 80 references.