Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research

Juli\'an Urbano

arxiv: 2604.25349 · v1 · submitted 2026-04-28 · 💻 cs.IR · stat.AP· stat.ME

Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research

Juli\'an Urbano This is my paper

Pith reviewed 2026-05-07 15:23 UTC · model grok-4.3

classification 💻 cs.IR stat.APstat.ME

keywords Wilcoxon signed-rank testInformation Retrieval evaluationType I errorstatistical testingnon-parametric methodsbenchmarkingsignificance testing

0 comments

The pith

The Wilcoxon signed-rank test fails to control Type I error in typical IR evaluations and should be abandoned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that IR researchers treat the Wilcoxon signed-rank test as a reliable non-parametric option for comparing retrieval systems, based on the idea that metric scores are not normally distributed. This view traces to inconsistent textbook advice that has encouraged routine misapplication. Analysis and demonstrations reveal that the test loses control of false-positive rates under the conditions common in IR benchmarking. If correct, the field would gain more trustworthy significance claims by switching away from it.

Core claim

The central claim is that the Wilcoxon signed-rank test, when applied to paired IR metric differences, routinely exceeds its nominal Type I error rate because the discrete and bounded nature of the scores violates the conditions required for the test's asymptotic properties to hold. This problem is compounded by decades of guidance that presented Wilcoxon as a safe default without acknowledging its sensitivity in small-sample, tied-data settings typical of IR.

What carries the argument

The Wilcoxon signed-rank test applied to differences in retrieval effectiveness metrics, whose discrete distributions and frequent ties cause the test statistic's distribution to deviate from the assumptions needed for proper p-value calibration.

If this is right

Past IR papers that relied on Wilcoxon results would need re-examination for overstated significance.
New evaluation guidelines would recommend tests that maintain error control under the discrete score distributions found in practice.
Textbook presentations of non-parametric tests would require updates to clarify when Wilcoxon remains valid.
IR benchmarking workflows would shift toward alternatives that do not introduce uncontrolled false positives.
Training for researchers would emphasize checking the actual operating characteristics of a chosen test rather than its nominal category.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar misapplications may occur in other fields that compare bounded performance scores across small numbers of trials.
The episode illustrates how an initial statistical recommendation can persist long after its assumptions cease to hold in the target domain.
Adoption of corrected practices could reduce the rate of irreproducible findings in comparative retrieval studies.
Development of domain-specific simulation tools would help researchers verify test behavior before use.

Load-bearing premise

The test collections and evaluation protocols used for the demonstrations are representative of the broader range of IR experiments.

What would settle it

A broad set of IR experiments in which the observed rejection rate under the null hypothesis stays at or below the nominal alpha level across multiple metrics and system pairs would falsify the claim that the test routinely breaks down.

Figures

Figures reproduced from arXiv: 2604.25349 by Juli\'an Urbano.

**Figure 1.** Figure 1: Effect of asymmetry, tail heaviness, discrete support and multimodality of view at source ↗

**Figure 2.** Figure 2: Symmetry and tail heaviness observed in 𝑫 distributions from TREC data at 𝒏 = 50. For reference, dashed lines represent the sampling distributions expected if 𝑫 were normally distributed, and likewise dotted lines if they had tails as heavy as observed but still symmetric. The largest observed values are |𝜸 | ≈ 7 and 𝜿 ≈ 45, but the axes are trimmed for clarity. treating heavier and lighter tails separatel… view at source ↗

read the original abstract

In benchmarking of Information Retrieval systems, the Wilcoxon signed-rank test is often treated as a safer alternative to the t-test. This belief is fueled by textbooks and recommendations that portray Wilcoxon as the proper non-parametric alternative because metric scores are not normally distributed. We argue that this narrative is misleading and harmful. A careful review of Statistics textbooks reveals inconsistencies and omissions in how the assumptions underlying these tests are presented, fostering confusion that has propagated into IR research. As a result, Wilcoxon has been routinely misapplied for decades, creating a false sense of safety against a threat that was never there to begin with, while introducing another one so severe that it virtually guarantees the test will break down and mislead researchers. Through a combination of systematic literature review, analysis and empirical demonstrations with TREC data, we show how and why the Wilcoxon test easily loses control of its Type I error rate in IR settings. We conclude that the continued use of Wilcoxon in IR evaluation is unjustified and that abandoning it would improve the methodological soundness of our field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses TREC simulations to show Wilcoxon signed-rank often inflates Type I error in IR comparisons, plus a textbook review, but the push to drop it everywhere depends on how typical those TREC traits are.

read the letter

The main point is that this paper documents how the Wilcoxon signed-rank test frequently exceeds its stated Type I error rate when applied to standard IR system comparisons on TREC data, and that IR textbooks have given inconsistent guidance on its assumptions for decades. That combination explains the routine misuse the authors describe. The textbook review is a clear addition because it traces how the non-parametric justification got simplified in ways that ignore the symmetry and continuity requirements. The empirical part runs controlled simulations on public TREC collections to count actual false positives, which gives concrete numbers rather than just theory. Those runs are the new evidence here and they line up with the claim that the test breaks down under the small topic sets, bounded discrete metrics, and frequent ties common in IR. The simulations use real collections so the results can be checked. The soft spot is the step from these TREC-specific conditions to a field-wide recommendation to stop using the test. TREC setups often have exactly the features that stress the test's ranking and null assumptions, such as fixed topic counts near 50 and score distributions with many near-zero differences. If those drive the error inflation, the same problem may not appear as strongly in other IR tasks with larger topic sets or different metrics. The paper could have tested a few more varied collections to tighten that link. This work is for IR researchers who run significance tests on ranked system outputs and for anyone teaching evaluation methodology. Readers who already worry about statistical practice in benchmarking will get direct value from the error-rate measurements. It deserves a serious referee because the empirical demonstrations are grounded in public data and the textbook analysis is independent of the simulations. I would send it for review after the authors clarify the scope of the generalization.

Referee Report

2 major / 2 minor

Summary. The paper argues that the Wilcoxon signed-rank test is routinely misused in IR evaluation due to misleading textbook portrayals of it as a safe non-parametric alternative to the t-test for non-normal metric scores. Through a literature review of statistical assumptions, analysis of violations in IR contexts, and empirical simulations on TREC data, the authors demonstrate that the test frequently loses control of its Type I error rate, and conclude that its continued use is unjustified and should be abandoned to improve methodological soundness.

Significance. If the central empirical claim holds, this would be a significant contribution to IR methodology, directly challenging a practice recommended in many textbooks and used across numerous papers. The combination of textbook analysis with concrete TREC-based simulations that measure Type I error inflation provides reproducible evidence on public data, strengthening the case for re-evaluating standard practices. The result, if generalizable, could lead to more reliable statistical comparisons in system benchmarking.

major comments (2)

[Empirical section (simulations)] The empirical demonstrations (likely §4 or equivalent) show Type I error inflation on TREC collections, but the simulation protocol details—such as exact null distribution sampling, tie-breaking rules for discrete metrics like AP, and number of repetitions—are insufficient to fully assess robustness. This is load-bearing for the claim that Wilcoxon 'virtually guarantees' breakdown.
[Conclusions] The strong conclusion to abandon Wilcoxon field-wide rests on TREC setups with fixed ~50 topics and bounded discrete metrics that produce ties/near-zero differences. The paper should test or explicitly discuss whether the observed inflation generalizes to other common IR scenarios (e.g., larger topic sets or continuous metrics) to support the broad recommendation.

minor comments (2)

[Literature review] The literature review could benefit from more precise page or section citations when pointing out inconsistencies in specific statistics textbooks.
[Figures] Figures showing Type I error rates would be clearer with added confidence intervals or variability measures across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for acknowledging the potential significance of our work. We will revise the manuscript to address the concerns raised regarding the empirical simulations and the generalizability of our conclusions.

read point-by-point responses

Referee: [Empirical section (simulations)] The empirical demonstrations (likely §4 or equivalent) show Type I error inflation on TREC collections, but the simulation protocol details—such as exact null distribution sampling, tie-breaking rules for discrete metrics like AP, and number of repetitions—are insufficient to fully assess robustness. This is load-bearing for the claim that Wilcoxon 'virtually guarantees' breakdown.

Authors: We agree that additional details on the simulation protocol are necessary to allow full assessment of our empirical claims. In the revised manuscript, we will expand the description in the empirical section to include precise information on how the null distribution is sampled, the specific tie-breaking rules applied for discrete metrics such as AP, and the number of repetitions used in the simulations. This will strengthen the reproducibility and robustness evaluation of the Type I error results. revision: yes
Referee: [Conclusions] The strong conclusion to abandon Wilcoxon field-wide rests on TREC setups with fixed ~50 topics and bounded discrete metrics that produce ties/near-zero differences. The paper should test or explicitly discuss whether the observed inflation generalizes to other common IR scenarios (e.g., larger topic sets or continuous metrics) to support the broad recommendation.

Authors: We acknowledge that our primary empirical evidence comes from TREC collections with approximately 50 topics and discrete metrics. To support the recommendation more broadly, we will add an explicit discussion in the conclusions section addressing the generalizability to other scenarios, such as larger numbers of topics or continuous metrics. We will explain why the identified problems (ties and small differences) are likely to persist in many IR evaluation settings but will also note the limitations of our current experiments and suggest avenues for future validation. This will temper the conclusion appropriately while maintaining the core argument based on the evidence presented. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper builds its case via external statistical theory from textbooks, a systematic literature review of IR practices, and fresh empirical simulations on public TREC collections using standard metrics like AP and nDCG. These steps rely on independent data and established assumptions rather than any fitted parameters, self-referential definitions, or load-bearing self-citations. The central claim that Wilcoxon loses Type I error control in IR settings does not reduce to the paper's own inputs by construction; the TREC-based demonstrations are reproducible external benchmarks. Minor self-citations to prior IR stats work are present but not load-bearing for the main argument.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard statistical assumptions about the Wilcoxon test's requirements for independence and continuous distributions, plus the representativeness of TREC topic-level scores. No new entities are postulated and no parameters are fitted to produce the main result.

axioms (2)

standard math Wilcoxon signed-rank test requires independent observations and continuous underlying distributions for its Type I error guarantee to hold exactly.
Invoked when explaining why the test breaks down in IR settings with topic-level dependence and ties.
domain assumption TREC-style evaluation collections produce score distributions and dependence structures that are typical of IR benchmarking practice.
Used to generalize the empirical demonstrations to the broader field.

pith-pipeline@v0.9.0 · 5475 in / 1416 out tokens · 56133 ms · 2026-05-07T15:23:01.853986+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 53 canonical work pages

[1]

Anderson, Dennis J

David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm, James J. Cochran, Michael J. Fry, and Jeffrey W. Ohlmann. 2019.Statistics for Business and Economics(14 ed.). Cengage Learning

2019
[2]

Coups, and Elaine N

Arthur Aron, Elliot J. Coups, and Elaine N. Aron. 2013.Statistics for Psychology (6 ed.). Pearson. 744 pages

2013
[3]

2014.Probability and Statistics for Computer Scientists(2 ed.)

Michael Baron. 2014.Probability and Statistics for Computer Scientists(2 ed.). CRC Press. 473 pages

2014
[4]

Johnson, Shahin Hashtroudi, and Stephen L

R. Clifford Blair and James J. Higgins. 1985. Comparison of the Power of the Paired Samples t test to that of Wilcoxon’s Signed-ranks Test Under Various Population Ahapes.Psychological Bulletin97, 1 (1985), 119–128. doi:10.1037/0033- 2909.97.1.119

work page doi:10.1037/0033- 1985
[5]

David Bodoff. 2008. Test Theory for Evaluating Reliability of IR Test Collections. Information Processing and Management44, 3 (2008), 1117–1145. doi:10.1016/j. ipm.2007.11.006

work page doi:10.1016/j 2008
[6]

Alan Boneau

C. Alan Boneau. 1960. The Effects of Violations of Assumptions Underlying the t test.Psychological Bulletin57, 1 (1960), 49–64. doi:10.1037/h0041412

work page doi:10.1037/h0041412 1960
[7]

George E. P. Box. 1953. Non-Normality and Tests on Variances.Biometrika40, 3/4 (1953), 318. doi:10.2307/2333350

work page doi:10.2307/2333350 1953
[8]

George E. P. Box, J. Stuart Hunter, and William G. Hunter. 2005.Statistics for Experimenters: Design, Innovation and Discovery(2 ed.). Wiley

2005
[9]

Ben Carterette. 2012. Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments.ACM Transactions on Information Systems 30, 1 (2012). doi:10.1145/2094072.2094076

work page doi:10.1145/2094072.2094076 2012
[10]

Ben Carterette. 2015. Bayesian Inference for Information Retrieval Evaluation. InInternational Conference on the Theory of Information Retrieval. 31–40. doi:10. 1145/2808194.2809469

work page arXiv 2015
[11]

Ben Carterette. 2017. But Is It Statistically Significant?. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 1125–1128. doi:10.1145/3077136.3080738

work page doi:10.1145/3077136.3080738 2017
[12]

Rocío Cañamares, Pablo Castells, and Alistair Moffat. 2020. Offline Evaluation Options for Recommender Systems.Information Retrieval Journal23, 4 (2020), 387–410. doi:10.1007/s10791-020-09371-3

work page doi:10.1007/s10791-020-09371-3 2020
[13]

Chaffin and Steven G

Wilkie W. Chaffin and Steven G. Rhiel. 1993. The Effect of Skewness and Kurtosis on the One-Sample T Test and the Impact of Knowledge of the Population Standard Deviation.Journal of Statistical Computation and Simulation46, 1-2 (1993), 79–90. doi:10.1080/00949659308811494

work page doi:10.1080/00949659308811494 1993
[14]

Cicchitelli

G. Cicchitelli. 1989. On the Robustness of the One-Sample t Test.Journal of Statistical Computation and Simulation32, 4 (1989), 249–258. doi:10.1080/ 00949658908811181

1989
[15]

2018.Research Methods in Education(8 ed.)

Louis Cohen, Lawrence Manion, and Keith Morrison. 2018.Research Methods in Education(8 ed.). Routledge

2018
[16]

William Jay Conover. 1973. On Methods of Handling Ties in the Wilcoxon Signed-Rank Test.J. Amer. Statist. Assoc.68, 344 (1973), 985–988. doi:10.1080/ 01621459.1973.10481460

work page arXiv 1973
[17]

William Jay Conover. 1973. Rank Tests for One Sample, Two Samples, and k Samples Without the Assumption of a Continuous Distribution Function.The Annals of Statistics1, 6 (1973), 1105–1125

1973
[18]

W. J. Conover. 1999.Practical Nonparametric Statistics(3 ed.). Wiley. doi:10.2307/ 1271101

1999
[19]

Clyde H. Coombs. 1950. Psychological Scaling Without a Unit of Measurement. Psychological Review57, 3 (1950), 145–158. doi:10.1037/h0060984

work page doi:10.1037/h0060984 1950
[20]

Cormack and Thomas R

Gordon V. Cormack and Thomas R. Lynam. 2007. Validity and Power of t-test for Comparing MAP and GMAP. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 753–754. doi:10.1145/1277741.1277892

work page doi:10.1145/1277741.1277892 2007
[21]

F. M. Dekking, C. Kraaikamp, H. P. Lopuhaä, and L. E. Meester. 2005.A Modern Introduction to Probability and Statistics: Understanding Why and How(1 ed.). Springer

2005
[22]

Guglielmo Faggioli, Nicola Ferro, and Norbert Fuhr. 2022. Detecting Significant Differences Between Information Retrieval Systems via Generalized Linear Mod- els. InACM International Conference on Information and Knowledge Management. 446–456. doi:10.1145/3511808.3557286

work page doi:10.1145/3511808.3557286 2022
[23]

Carmen Fernández and Mark F. J. Steel. 1998. On Bayesian Modeling of Fat Tails and Skewness.J. Amer. Statist. Assoc.93, 441 (1998), 359–371. doi:10.2307/2669632

work page doi:10.2307/2669632 1998
[24]

Marco Ferrante, Nicola Ferro, and Norbert Fuhr. 2021. Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales. IEEE Access9 (2021), 136182–136216. doi:10.1109/access.2021.3116857

work page doi:10.1109/access.2021.3116857 2021
[25]

Nicola Ferro, Yubin Kim, and Mark Sanderson. 2019. Using Collection Shards to Study Retrieval Performance Effect Sizes.ACM Transactions on Information Systems37, 3 (2019). doi:10.1145/3310364

work page doi:10.1145/3310364 2019
[26]

Nicola Ferro and Gianmaria Silvello. 2016. A General Linear Mixed Mod- els Approach to Study System Component Effects. InInternational ACM SI- GIR conference on Research and Development in Information Retrieval. 25–34. doi:10.1145/2911451.2911530

work page doi:10.1145/2911451.2911530 2016
[27]

Norbert Fuhr. 2018. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided.ACM SIGIR Forum51, 3 (2018), 32–41. doi:10.1145/3190580.3190586 Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia

work page doi:10.1145/3190580.3190586 2018
[28]

Frank E. Harrell. 2015.Regression Modeling Strategies, with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis(2 ed.). Springer

2015
[29]

J. A. Hartigan and P. M. Hartigan. 1985. The Dip Test of Unimodality.The Annals of Statistics13, 1 (1985), 70–84. doi:10.1214/aos/1176346577

work page doi:10.1214/aos/1176346577 1985
[30]

Headrick, Rhonda K

Todd C. Headrick, Rhonda K. Kowalchuk, and Yanyan Sheng. 2008. Parametric Probability Densities and Distribution Functions for Tukey g-and-h Transforma- tions and Their Use for Fitting Data.Applied Mathematical Sciences2, 9 (2008), 449–462

2008
[31]

Hettmansperger

Thomas P. Hettmansperger. 1984.Statistical Inference Based on Ranks. Wiley

1984
[32]

Myles Hollander and Douglas A. Wolfe. 1973.Nonparametric Statistical Methods (1 ed.). Wiley

1973
[33]

David Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experi- ments. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 329–338. doi:10.1145/160688.160758

work page doi:10.1145/160688.160758 1993
[34]

Gopal K. Kanji. 2006.100 Statistical Tests(3 ed.). Sage

2006
[35]

E. L. Lehmann and Joseph P. Romano. 2022.Testing Statistical Hypotheses(4 ed.). Springer

2022
[36]

Heng Li and Terri Johnson. 2014. Wilcoxon’s Signed-rank Statistic: What Null Hypothesis and Why it Matters.Pharmaceutical Statistics13, 5 (2014), 281–285. doi:10.1002/pst.1628

work page doi:10.1002/pst.1628 2014
[37]

Thomas Lumley, Paula Diehr, Scott Emerson, and Lu Chen. 2002. The Importance of the Normality Assumption in Large Public Health Data Sets.Annual Review of Public Health23, 1 (2002), 151–169. doi:10.1146/annurev.publhealth.23.100901. 140546

work page doi:10.1146/annurev.publhealth.23.100901 2002
[38]

Bryan F. J. Manly. 2008.Statistics for Environmental Science and Management(2 ed.). CRC Press

2008
[39]

Beaver, and Robert J

William Mendenhall, Barbara M. Beaver, and Robert J. Beaver. 2018.Introduction to Probability and Statistics(15 ed.). Cengage

2018
[40]

Theodore Micceri. 1989. The Unicorn, the Normal Curve, and Other Improbable Creatures.Psychological Bulletin105, 1 (1989), 156–166. doi:10.1037/0033-2909. 105.1.156

work page doi:10.1037/0033-2909 1989
[41]

Joel Michell. 1986. Measurement Scales and Statistics: A Clash of Paradigms. Psychological Bulletin100, 3 (1986), 398–407. doi:10.1037/0033-2909.100.3.398

work page doi:10.1037/0033-2909.100.3.398 1986
[42]

Alistair Moffat. 2022. Batch Evaluation Metrics in Information Retrieval: Mea- sures, Scales, and Meaning.IEEE Access10 (2022), 105564–105577. doi:10.1109/ access.2022.3211668

work page arXiv 2022
[43]

Montgomery and George C

Douglas C. Montgomery and George C. Runger. 2014.Applied Statistics and Probability for Engineers(6 ed.). Wiley

2014
[44]

Saralees Nadarajah. 2005. A Generalized Normal Distribution.Journal of Applied Statistics32, 7 (2005), 685–694. doi:10.1080/02664760500079464

work page doi:10.1080/02664760500079464 2005
[45]

2012.Nonparametric Statistical Tests: A Computational Ap- proach

Markus Neuhäuser. 2012.Nonparametric Statistical Tests: A Computational Ap- proach. CRC Press

2012
[46]

Lyman Ott and Michael Longnecker

R. Lyman Ott and Michael Longnecker. 2015.An Introduction to Statistical Methods and Data Analysis(7 ed.). Cengage

2015
[47]

Losada, and Álvaro Barreiro

Javier Parapar, David E. Losada, and Álvaro Barreiro. 2021. Testing the Tests: Simulation of Rankings to Compare Statistical Significance Tests in Information Retrieval Evaluation. InACM/SIGAPP Symposium on Applied Computing. 655–664. doi:10.1145/3412841.3441945

work page doi:10.1145/3412841.3441945 2021
[48]

Losada, Manuel A

Javier Parapar, David E. Losada, Manuel A. Presedo Quindimil, and Álvaro Bar- reiro. 2020. Using Score Distributions to Compare Statistical Significance Tests for Information Retrieval Evaluation.Journal of the Association for Information Science and Technology71, 1 (2020), 98–113. doi:10.1002/asi.24203

work page doi:10.1002/asi.24203 2020
[49]

John W. Pratt. 1959. Remarks on Zeros and Ties in the Wilcoxon Signed Rank Procedures.J. Amer. Statist. Assoc.54, 287 (1959), 655–667. doi:10.1080/01621459. 1959.10501526

work page doi:10.1080/01621459 1959
[50]

Pratt and Jean D

John W. Pratt and Jean D. Gibbons. 1981.Concepts of Nonparametric Theory(1 ed.). Springer

1981
[51]

Privitera

Gregory J. Privitera. 2014.Statistics for the Behavioral Sciences(3 ed.). Sage

2014
[52]

John A. Rice. 2007.Mathematical Statistics and Data Analysis(3 ed.). Duxbury. 650 pages

2007
[53]

Tetsuya Sakai. 2006. Evaluating Evaluation Metrics Based on the Bootstrap. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 525–532. doi:10.1145/1148170.1148261

work page doi:10.1145/1148170.1148261 2006
[54]

Tetsuya Sakai. 2016. Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015. InInternational ACM SI- GIR Conference on Research and Development in Information Retrieval. 5–14. doi:10.1145/2911451.2911492

work page doi:10.1145/2911451.2911492 2016
[55]

Tetsuya Sakai. 2016. Two Sample T-tests for IR Evaluation: Student or Welch?. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 1045–1048. doi:10.1145/2911451.2914684

work page doi:10.1145/2911451.2914684 2016
[56]

Tetsuya Sakai. 2020. On Fuhr’s Guideline for IR Evaluation.ACM SIGIR Forum 54, 1 (2020), 1–8. doi:10.1145/3451964.3451976

work page doi:10.1145/3451964.3451976 2020
[57]

Mark Sanderson and Justin Zobel. 2005. Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 162–169. doi:10.1145/1076034. 1076064

work page doi:10.1145/1076034 2005
[58]

Jacques Savoy. 1997. Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing and Management33, 4 (1997), 495–512. doi:10.1016/s0306- 4573(97)00027-7

work page doi:10.1016/s0306- 1997
[59]

Sawilowsky and R

Shlomo S. Sawilowsky and R. Clifford Blair. 1992. A More Realistic Look at the Robustness and Type II Error Properties of the t Test to Departures from Population Normality.Quantitative Methods in Psychology111, 2 (1992), 352–360. doi:10.1037//0033-2909.111.2.352

work page doi:10.1037//0033-2909.111.2.352 1992
[60]

David J. Sheskin. 2000.Handbook of Parametric and Nonparametric Statistical Procedures(2 ed.). Chapman & Hall

2000
[61]

1956.Nonparametric Statistics for the Behavioral Sciences(1 ed.)

Sidney Siegel. 1956.Nonparametric Statistics for the Behavioral Sciences(1 ed.). McGraw Hill

1956
[62]

B. W. Silverman. 1981. Using Kernel Density Estimates to Investigate Multimodal- ity.Journal of the Royal Statistical Society Series B: Statistical Methodology43, 1 (1981), 97–99. doi:10.1111/j.2517-6161.1981.tb01155.x

work page doi:10.1111/j.2517-6161.1981.tb01155.x 1981
[63]

Smucker, James Allan, and Ben Carterette

Mark D. Smucker, James Allan, and Ben Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. InACM In- ternational Conference on Information and Knowledge Management. 623–632. doi:10.1145/1321440.1321528

work page doi:10.1145/1321440.1321528 2007
[64]

Smucker, James Allan, and Ben Carterette

Mark D. Smucker, James Allan, and Ben Carterette. 2009. Agreement Among Statistical Significance Tests for Information Retrieval Evaluation at Varying Sample Sizes. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 630–631. doi:10.1145/1571941.1572050

work page doi:10.1145/1571941.1572050 2009
[65]

Sokal and F

Robert R. Sokal and F. James Rohlf. 1995.Biometry: The Principles and Practice of Statistics in Biological Research(3 ed.). W.H. Freeman

1995
[66]

Student. 1908. The Probable Error of a Mean.Biometrika6, 1 (1908), 1–25. doi:10.2307/2331554

work page doi:10.2307/2331554 1908
[67]

M. Th. Subbotin. 1923. On the Law of Frequency of Error.Matematicheskii Sbornik31, 2 (1923), 296–301

1923
[68]

Jean Tague-Sutcliffe. 1992. The Pragmatics of Information Retrieval Experimenta- tion, Revisited.Information Processing and Management28, 4 (jul 1992), 467–490. doi:10.1016/0306-4573(92)90005-k

work page doi:10.1016/0306-4573(92)90005-k 1992
[69]

John W. Tukey. 1977. Modern Techniques in Data Analysis. InNSF-sponsored regionalresearch conference at Southern Massachusetts University

1977
[70]

Julián Urbano, Matteo Corsi, and Alan Hanjalic. 2021. How do Metric Score Distributions Affect the Type I Error Rate of Statistical Significance Tests in Information Retrieval?. InACM SIGIR International Conference on the Theory of Information Retrieval. 245–250. doi:10.1145/3471158.3472242

work page doi:10.1145/3471158.3472242 2021
[71]

Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical Significance Test- ing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 505–514. doi:10.1145/3331184.3331259

work page doi:10.1145/3331184.3331259 2019
[72]

Julián Urbano, Mónica Marrero, and Diego Martín. 2013. A Comparison of the Op- timality of Statistical Significance Tests for Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 925–928. doi:10.1145/2484028.2484163

work page doi:10.1145/2484028.2484163 2013
[73]

Julián Urbano and Thomas Nagler. 2018. Stochastic Simulation of Test Collec- tions: Evaluation Scores. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 695–704. doi:10.1145/3209978.3210043

work page doi:10.1145/3209978.3210043 2018
[74]

2001.Doing Science: Design, Analysis and Communication of Scientific Research(1 ed.)

Ivan Valiela. 2001.Doing Science: Design, Analysis and Communication of Scientific Research(1 ed.). Oxford University Press. doi:10.1093/oso/9780195079623.001. 0001

work page doi:10.1093/oso/9780195079623.001 2001
[75]

van Rijsbergen

Cornelis J. van Rijsbergen. 1979.Information Retrieval. Butterworths. doi:10. 1145/511829.511831

work page arXiv 1979
[76]

Voorhees and Chris Buckley

Ellen M. Voorhees and Chris Buckley. 2002. The Effect of Topic Set Size on Retrieval Experiment Error. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 316–323. doi:10.1145/564376.564432

work page doi:10.1145/564376.564432 2002
[77]

Voorhees, Daniel Samarov, and Ian Soboroff

Ellen M. Voorhees, Daniel Samarov, and Ian Soboroff. 2017. Using Replicates in Information Retrieval Evaluation.ACM Transactions on Information Systems36, 2 (2017). doi:10.1145/3086701

work page doi:10.1145/3086701 2017
[78]

2006.All of Nonparametric Statistics(1 ed.)

Larry Wasserman. 2006.All of Nonparametric Statistics(1 ed.). Springer

2006
[79]

2003.Exact Statistical Methods for Data Analysis(1 ed.)

Samaradasa Weerahandi. 2003.Exact Statistical Methods for Data Analysis(1 ed.). Springer

2003
[80]

John Wilbur

W. John Wilbur. 1994. Non-parametric Ssignificance Tests of Retrieval Per- formance Comparisons.Journal of Information Science20, 4 (1994), 270–284. doi:10.1177/016555159402000405

work page doi:10.1177/016555159402000405 1994

Showing first 80 references.

[1] [1]

Anderson, Dennis J

David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm, James J. Cochran, Michael J. Fry, and Jeffrey W. Ohlmann. 2019.Statistics for Business and Economics(14 ed.). Cengage Learning

2019

[2] [2]

Coups, and Elaine N

Arthur Aron, Elliot J. Coups, and Elaine N. Aron. 2013.Statistics for Psychology (6 ed.). Pearson. 744 pages

2013

[3] [3]

2014.Probability and Statistics for Computer Scientists(2 ed.)

Michael Baron. 2014.Probability and Statistics for Computer Scientists(2 ed.). CRC Press. 473 pages

2014

[4] [4]

Johnson, Shahin Hashtroudi, and Stephen L

R. Clifford Blair and James J. Higgins. 1985. Comparison of the Power of the Paired Samples t test to that of Wilcoxon’s Signed-ranks Test Under Various Population Ahapes.Psychological Bulletin97, 1 (1985), 119–128. doi:10.1037/0033- 2909.97.1.119

work page doi:10.1037/0033- 1985

[5] [5]

David Bodoff. 2008. Test Theory for Evaluating Reliability of IR Test Collections. Information Processing and Management44, 3 (2008), 1117–1145. doi:10.1016/j. ipm.2007.11.006

work page doi:10.1016/j 2008

[6] [6]

Alan Boneau

C. Alan Boneau. 1960. The Effects of Violations of Assumptions Underlying the t test.Psychological Bulletin57, 1 (1960), 49–64. doi:10.1037/h0041412

work page doi:10.1037/h0041412 1960

[7] [7]

George E. P. Box. 1953. Non-Normality and Tests on Variances.Biometrika40, 3/4 (1953), 318. doi:10.2307/2333350

work page doi:10.2307/2333350 1953

[8] [8]

George E. P. Box, J. Stuart Hunter, and William G. Hunter. 2005.Statistics for Experimenters: Design, Innovation and Discovery(2 ed.). Wiley

2005

[9] [9]

Ben Carterette. 2012. Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments.ACM Transactions on Information Systems 30, 1 (2012). doi:10.1145/2094072.2094076

work page doi:10.1145/2094072.2094076 2012

[10] [10]

Ben Carterette. 2015. Bayesian Inference for Information Retrieval Evaluation. InInternational Conference on the Theory of Information Retrieval. 31–40. doi:10. 1145/2808194.2809469

work page arXiv 2015

[11] [11]

Ben Carterette. 2017. But Is It Statistically Significant?. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 1125–1128. doi:10.1145/3077136.3080738

work page doi:10.1145/3077136.3080738 2017

[12] [12]

Rocío Cañamares, Pablo Castells, and Alistair Moffat. 2020. Offline Evaluation Options for Recommender Systems.Information Retrieval Journal23, 4 (2020), 387–410. doi:10.1007/s10791-020-09371-3

work page doi:10.1007/s10791-020-09371-3 2020

[13] [13]

Chaffin and Steven G

Wilkie W. Chaffin and Steven G. Rhiel. 1993. The Effect of Skewness and Kurtosis on the One-Sample T Test and the Impact of Knowledge of the Population Standard Deviation.Journal of Statistical Computation and Simulation46, 1-2 (1993), 79–90. doi:10.1080/00949659308811494

work page doi:10.1080/00949659308811494 1993

[14] [14]

Cicchitelli

G. Cicchitelli. 1989. On the Robustness of the One-Sample t Test.Journal of Statistical Computation and Simulation32, 4 (1989), 249–258. doi:10.1080/ 00949658908811181

1989

[15] [15]

2018.Research Methods in Education(8 ed.)

Louis Cohen, Lawrence Manion, and Keith Morrison. 2018.Research Methods in Education(8 ed.). Routledge

2018

[16] [16]

William Jay Conover. 1973. On Methods of Handling Ties in the Wilcoxon Signed-Rank Test.J. Amer. Statist. Assoc.68, 344 (1973), 985–988. doi:10.1080/ 01621459.1973.10481460

work page arXiv 1973

[17] [17]

William Jay Conover. 1973. Rank Tests for One Sample, Two Samples, and k Samples Without the Assumption of a Continuous Distribution Function.The Annals of Statistics1, 6 (1973), 1105–1125

1973

[18] [18]

W. J. Conover. 1999.Practical Nonparametric Statistics(3 ed.). Wiley. doi:10.2307/ 1271101

1999

[19] [19]

Clyde H. Coombs. 1950. Psychological Scaling Without a Unit of Measurement. Psychological Review57, 3 (1950), 145–158. doi:10.1037/h0060984

work page doi:10.1037/h0060984 1950

[20] [20]

Cormack and Thomas R

Gordon V. Cormack and Thomas R. Lynam. 2007. Validity and Power of t-test for Comparing MAP and GMAP. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 753–754. doi:10.1145/1277741.1277892

work page doi:10.1145/1277741.1277892 2007

[21] [21]

F. M. Dekking, C. Kraaikamp, H. P. Lopuhaä, and L. E. Meester. 2005.A Modern Introduction to Probability and Statistics: Understanding Why and How(1 ed.). Springer

2005

[22] [22]

Guglielmo Faggioli, Nicola Ferro, and Norbert Fuhr. 2022. Detecting Significant Differences Between Information Retrieval Systems via Generalized Linear Mod- els. InACM International Conference on Information and Knowledge Management. 446–456. doi:10.1145/3511808.3557286

work page doi:10.1145/3511808.3557286 2022

[23] [23]

Carmen Fernández and Mark F. J. Steel. 1998. On Bayesian Modeling of Fat Tails and Skewness.J. Amer. Statist. Assoc.93, 441 (1998), 359–371. doi:10.2307/2669632

work page doi:10.2307/2669632 1998

[24] [24]

Marco Ferrante, Nicola Ferro, and Norbert Fuhr. 2021. Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales. IEEE Access9 (2021), 136182–136216. doi:10.1109/access.2021.3116857

work page doi:10.1109/access.2021.3116857 2021

[25] [25]

Nicola Ferro, Yubin Kim, and Mark Sanderson. 2019. Using Collection Shards to Study Retrieval Performance Effect Sizes.ACM Transactions on Information Systems37, 3 (2019). doi:10.1145/3310364

work page doi:10.1145/3310364 2019

[26] [26]

Nicola Ferro and Gianmaria Silvello. 2016. A General Linear Mixed Mod- els Approach to Study System Component Effects. InInternational ACM SI- GIR conference on Research and Development in Information Retrieval. 25–34. doi:10.1145/2911451.2911530

work page doi:10.1145/2911451.2911530 2016

[27] [27]

Norbert Fuhr. 2018. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided.ACM SIGIR Forum51, 3 (2018), 32–41. doi:10.1145/3190580.3190586 Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia

work page doi:10.1145/3190580.3190586 2018

[28] [28]

Frank E. Harrell. 2015.Regression Modeling Strategies, with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis(2 ed.). Springer

2015

[29] [29]

J. A. Hartigan and P. M. Hartigan. 1985. The Dip Test of Unimodality.The Annals of Statistics13, 1 (1985), 70–84. doi:10.1214/aos/1176346577

work page doi:10.1214/aos/1176346577 1985

[30] [30]

Headrick, Rhonda K

Todd C. Headrick, Rhonda K. Kowalchuk, and Yanyan Sheng. 2008. Parametric Probability Densities and Distribution Functions for Tukey g-and-h Transforma- tions and Their Use for Fitting Data.Applied Mathematical Sciences2, 9 (2008), 449–462

2008

[31] [31]

Hettmansperger

Thomas P. Hettmansperger. 1984.Statistical Inference Based on Ranks. Wiley

1984

[32] [32]

Myles Hollander and Douglas A. Wolfe. 1973.Nonparametric Statistical Methods (1 ed.). Wiley

1973

[33] [33]

David Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experi- ments. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 329–338. doi:10.1145/160688.160758

work page doi:10.1145/160688.160758 1993

[34] [34]

Gopal K. Kanji. 2006.100 Statistical Tests(3 ed.). Sage

2006

[35] [35]

E. L. Lehmann and Joseph P. Romano. 2022.Testing Statistical Hypotheses(4 ed.). Springer

2022

[36] [36]

Heng Li and Terri Johnson. 2014. Wilcoxon’s Signed-rank Statistic: What Null Hypothesis and Why it Matters.Pharmaceutical Statistics13, 5 (2014), 281–285. doi:10.1002/pst.1628

work page doi:10.1002/pst.1628 2014

[37] [37]

Thomas Lumley, Paula Diehr, Scott Emerson, and Lu Chen. 2002. The Importance of the Normality Assumption in Large Public Health Data Sets.Annual Review of Public Health23, 1 (2002), 151–169. doi:10.1146/annurev.publhealth.23.100901. 140546

work page doi:10.1146/annurev.publhealth.23.100901 2002

[38] [38]

Bryan F. J. Manly. 2008.Statistics for Environmental Science and Management(2 ed.). CRC Press

2008

[39] [39]

Beaver, and Robert J

William Mendenhall, Barbara M. Beaver, and Robert J. Beaver. 2018.Introduction to Probability and Statistics(15 ed.). Cengage

2018

[40] [40]

Theodore Micceri. 1989. The Unicorn, the Normal Curve, and Other Improbable Creatures.Psychological Bulletin105, 1 (1989), 156–166. doi:10.1037/0033-2909. 105.1.156

work page doi:10.1037/0033-2909 1989

[41] [41]

Joel Michell. 1986. Measurement Scales and Statistics: A Clash of Paradigms. Psychological Bulletin100, 3 (1986), 398–407. doi:10.1037/0033-2909.100.3.398

work page doi:10.1037/0033-2909.100.3.398 1986

[42] [42]

Alistair Moffat. 2022. Batch Evaluation Metrics in Information Retrieval: Mea- sures, Scales, and Meaning.IEEE Access10 (2022), 105564–105577. doi:10.1109/ access.2022.3211668

work page arXiv 2022

[43] [43]

Montgomery and George C

Douglas C. Montgomery and George C. Runger. 2014.Applied Statistics and Probability for Engineers(6 ed.). Wiley

2014

[44] [44]

Saralees Nadarajah. 2005. A Generalized Normal Distribution.Journal of Applied Statistics32, 7 (2005), 685–694. doi:10.1080/02664760500079464

work page doi:10.1080/02664760500079464 2005

[45] [45]

2012.Nonparametric Statistical Tests: A Computational Ap- proach

Markus Neuhäuser. 2012.Nonparametric Statistical Tests: A Computational Ap- proach. CRC Press

2012

[46] [46]

Lyman Ott and Michael Longnecker

R. Lyman Ott and Michael Longnecker. 2015.An Introduction to Statistical Methods and Data Analysis(7 ed.). Cengage

2015

[47] [47]

Losada, and Álvaro Barreiro

Javier Parapar, David E. Losada, and Álvaro Barreiro. 2021. Testing the Tests: Simulation of Rankings to Compare Statistical Significance Tests in Information Retrieval Evaluation. InACM/SIGAPP Symposium on Applied Computing. 655–664. doi:10.1145/3412841.3441945

work page doi:10.1145/3412841.3441945 2021

[48] [48]

Losada, Manuel A

Javier Parapar, David E. Losada, Manuel A. Presedo Quindimil, and Álvaro Bar- reiro. 2020. Using Score Distributions to Compare Statistical Significance Tests for Information Retrieval Evaluation.Journal of the Association for Information Science and Technology71, 1 (2020), 98–113. doi:10.1002/asi.24203

work page doi:10.1002/asi.24203 2020

[49] [49]

John W. Pratt. 1959. Remarks on Zeros and Ties in the Wilcoxon Signed Rank Procedures.J. Amer. Statist. Assoc.54, 287 (1959), 655–667. doi:10.1080/01621459. 1959.10501526

work page doi:10.1080/01621459 1959

[50] [50]

Pratt and Jean D

John W. Pratt and Jean D. Gibbons. 1981.Concepts of Nonparametric Theory(1 ed.). Springer

1981

[51] [51]

Privitera

Gregory J. Privitera. 2014.Statistics for the Behavioral Sciences(3 ed.). Sage

2014

[52] [52]

John A. Rice. 2007.Mathematical Statistics and Data Analysis(3 ed.). Duxbury. 650 pages

2007

[53] [53]

Tetsuya Sakai. 2006. Evaluating Evaluation Metrics Based on the Bootstrap. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 525–532. doi:10.1145/1148170.1148261

work page doi:10.1145/1148170.1148261 2006

[54] [54]

Tetsuya Sakai. 2016. Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015. InInternational ACM SI- GIR Conference on Research and Development in Information Retrieval. 5–14. doi:10.1145/2911451.2911492

work page doi:10.1145/2911451.2911492 2016

[55] [55]

Tetsuya Sakai. 2016. Two Sample T-tests for IR Evaluation: Student or Welch?. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 1045–1048. doi:10.1145/2911451.2914684

work page doi:10.1145/2911451.2914684 2016

[56] [56]

Tetsuya Sakai. 2020. On Fuhr’s Guideline for IR Evaluation.ACM SIGIR Forum 54, 1 (2020), 1–8. doi:10.1145/3451964.3451976

work page doi:10.1145/3451964.3451976 2020

[57] [57]

Mark Sanderson and Justin Zobel. 2005. Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 162–169. doi:10.1145/1076034. 1076064

work page doi:10.1145/1076034 2005

[58] [58]

Jacques Savoy. 1997. Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing and Management33, 4 (1997), 495–512. doi:10.1016/s0306- 4573(97)00027-7

work page doi:10.1016/s0306- 1997

[59] [59]

Sawilowsky and R

Shlomo S. Sawilowsky and R. Clifford Blair. 1992. A More Realistic Look at the Robustness and Type II Error Properties of the t Test to Departures from Population Normality.Quantitative Methods in Psychology111, 2 (1992), 352–360. doi:10.1037//0033-2909.111.2.352

work page doi:10.1037//0033-2909.111.2.352 1992

[60] [60]

David J. Sheskin. 2000.Handbook of Parametric and Nonparametric Statistical Procedures(2 ed.). Chapman & Hall

2000

[61] [61]

1956.Nonparametric Statistics for the Behavioral Sciences(1 ed.)

Sidney Siegel. 1956.Nonparametric Statistics for the Behavioral Sciences(1 ed.). McGraw Hill

1956

[62] [62]

B. W. Silverman. 1981. Using Kernel Density Estimates to Investigate Multimodal- ity.Journal of the Royal Statistical Society Series B: Statistical Methodology43, 1 (1981), 97–99. doi:10.1111/j.2517-6161.1981.tb01155.x

work page doi:10.1111/j.2517-6161.1981.tb01155.x 1981

[63] [63]

Smucker, James Allan, and Ben Carterette

Mark D. Smucker, James Allan, and Ben Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. InACM In- ternational Conference on Information and Knowledge Management. 623–632. doi:10.1145/1321440.1321528

work page doi:10.1145/1321440.1321528 2007

[64] [64]

Smucker, James Allan, and Ben Carterette

Mark D. Smucker, James Allan, and Ben Carterette. 2009. Agreement Among Statistical Significance Tests for Information Retrieval Evaluation at Varying Sample Sizes. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 630–631. doi:10.1145/1571941.1572050

work page doi:10.1145/1571941.1572050 2009

[65] [65]

Sokal and F

Robert R. Sokal and F. James Rohlf. 1995.Biometry: The Principles and Practice of Statistics in Biological Research(3 ed.). W.H. Freeman

1995

[66] [66]

Student. 1908. The Probable Error of a Mean.Biometrika6, 1 (1908), 1–25. doi:10.2307/2331554

work page doi:10.2307/2331554 1908

[67] [67]

M. Th. Subbotin. 1923. On the Law of Frequency of Error.Matematicheskii Sbornik31, 2 (1923), 296–301

1923

[68] [68]

Jean Tague-Sutcliffe. 1992. The Pragmatics of Information Retrieval Experimenta- tion, Revisited.Information Processing and Management28, 4 (jul 1992), 467–490. doi:10.1016/0306-4573(92)90005-k

work page doi:10.1016/0306-4573(92)90005-k 1992

[69] [69]

John W. Tukey. 1977. Modern Techniques in Data Analysis. InNSF-sponsored regionalresearch conference at Southern Massachusetts University

1977

[70] [70]

Julián Urbano, Matteo Corsi, and Alan Hanjalic. 2021. How do Metric Score Distributions Affect the Type I Error Rate of Statistical Significance Tests in Information Retrieval?. InACM SIGIR International Conference on the Theory of Information Retrieval. 245–250. doi:10.1145/3471158.3472242

work page doi:10.1145/3471158.3472242 2021

[71] [71]

Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical Significance Test- ing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 505–514. doi:10.1145/3331184.3331259

work page doi:10.1145/3331184.3331259 2019

[72] [72]

Julián Urbano, Mónica Marrero, and Diego Martín. 2013. A Comparison of the Op- timality of Statistical Significance Tests for Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 925–928. doi:10.1145/2484028.2484163

work page doi:10.1145/2484028.2484163 2013

[73] [73]

Julián Urbano and Thomas Nagler. 2018. Stochastic Simulation of Test Collec- tions: Evaluation Scores. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 695–704. doi:10.1145/3209978.3210043

work page doi:10.1145/3209978.3210043 2018

[74] [74]

2001.Doing Science: Design, Analysis and Communication of Scientific Research(1 ed.)

Ivan Valiela. 2001.Doing Science: Design, Analysis and Communication of Scientific Research(1 ed.). Oxford University Press. doi:10.1093/oso/9780195079623.001. 0001

work page doi:10.1093/oso/9780195079623.001 2001

[75] [75]

van Rijsbergen

Cornelis J. van Rijsbergen. 1979.Information Retrieval. Butterworths. doi:10. 1145/511829.511831

work page arXiv 1979

[76] [76]

Voorhees and Chris Buckley

Ellen M. Voorhees and Chris Buckley. 2002. The Effect of Topic Set Size on Retrieval Experiment Error. InInternational ACM SIGIR Conference on Research and Development in Information Retrieval. 316–323. doi:10.1145/564376.564432

work page doi:10.1145/564376.564432 2002

[77] [77]

Voorhees, Daniel Samarov, and Ian Soboroff

Ellen M. Voorhees, Daniel Samarov, and Ian Soboroff. 2017. Using Replicates in Information Retrieval Evaluation.ACM Transactions on Information Systems36, 2 (2017). doi:10.1145/3086701

work page doi:10.1145/3086701 2017

[78] [78]

2006.All of Nonparametric Statistics(1 ed.)

Larry Wasserman. 2006.All of Nonparametric Statistics(1 ed.). Springer

2006

[79] [79]

2003.Exact Statistical Methods for Data Analysis(1 ed.)

Samaradasa Weerahandi. 2003.Exact Statistical Methods for Data Analysis(1 ed.). Springer

2003

[80] [80]

John Wilbur

W. John Wilbur. 1994. Non-parametric Ssignificance Tests of Retrieval Per- formance Comparisons.Journal of Information Science20, 4 (1994), 270–284. doi:10.1177/016555159402000405

work page doi:10.1177/016555159402000405 1994