Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings

Alexey Zaytsev; Rostislav Gusev

arxiv: 2606.27997 · v1 · pith:VTXKBM2Tnew · submitted 2026-06-26 · 💻 cs.LG · stat.ML

Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings

Rostislav Gusev , Alexey Zaytsev This is my paper

Pith reviewed 2026-06-29 05:09 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords dataset selectionmodel ranking preservationbenchmark efficiencytime series classificationfarthest-first selectionSpearman correlationbootstrap aggregation

0 comments

The pith

Small carefully chosen subsets of datasets can preserve model performance rankings nearly as well as full benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework to select small numbers of datasets from large benchmarks such that the relative rankings of models on those subsets closely match the rankings obtained from the entire collection. This matters because full benchmarks with dozens or hundreds of datasets make evaluation slow and expensive, so reliable smaller subsets would let researchers compare models more efficiently while keeping the same ordering of which models perform best. The authors compare multiple selection strategies including clustering, design optimality criteria, random selection, and greedy farthest-first against each other, using bootstrap aggregation to produce confidence intervals that allow statistical comparison of how well each strategy preserves rankings. In time series classification with 112 datasets, the strongest strategies reach a Spearman correlation of 0.95 with the full benchmark using only five selected datasets and outperform random selection, while gains are smaller or insignificant in natural language processing and recommender system benchmarks. Effectiveness of the methods depends on both the quality of the dataset representations and the overall scale of the benchmark.

Core claim

We introduce a framework for selecting dataset subsets that preserve global model rankings, incorporating bootstrap aggregation for confidence intervals and deriving upper bounds on ranking errors for farthest-first selection. Empirically, several strategies outperform random selection in preserving rankings, with the best achieving 0.95 Spearman correlation on time series classification using five datasets.

What carries the argument

The dataset selection strategies, particularly greedy farthest-first and clustering, evaluated by their effect on Spearman rank correlation with full-benchmark model rankings.

If this is right

Model evaluations on large benchmarks can be approximated with much smaller selected subsets while retaining high rank fidelity.
In time series classification, as few as five datasets may suffice for Spearman correlations around 0.95.
Bootstrap aggregation supplies valid confidence intervals that enable principled statistical comparison among selection strategies.
Upper bounds on ranking errors can be stated for farthest-first selection as a function of the number of chosen datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection logic could be tested on benchmarks outside the three domains examined here to check whether high rank preservation holds more generally.
If better dataset representations become available, the correlation achieved by the same number of selected datasets could rise further.
Applying the selected subsets to entirely new models that were not part of the original ranking would test whether the preservation extends beyond the models used to choose the subsets.

Load-bearing premise

The dataset representations used for clustering and selection must accurately capture the performance differences that determine model rankings across the full set.

What would settle it

Finding that the selected five datasets in the time series classification benchmark produce a Spearman correlation below 0.7 with the full set of 112 datasets would contradict the reported level of rank preservation.

Figures

Figures reproduced from arXiv: 2606.27997 by Alexey Zaytsev, Rostislav Gusev.

**Figure 1.** Figure 1: We study how to select a small subset of datasets that best preserves the global model ranking of a large benchmark. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Rank preservation under dataset subset selection [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Benchmarks of machine learning models often include many datasets, making evaluation expensive. For efficiency, it is preferable to perform evaluations on small, representative datasets instead. The selection of such subsets typically relies on heuristics and is rarely analyzed for the robustness of the resulting model rankings. We introduce a framework to perform the task of selecting datasets subsets with an evaluation of how different selection strategies preserve the global model rankings. Our framework includes bootstrap aggregation, which provides valid confidence intervals, allowing a principled comparison of selection strategies. We consider clustering, design criteria (A/D-optimality), random baselines, and greedy farthest-first (FAFI). For the latter, we derive upper bounds on selection quality in terms of ranking errors as a function of the number of selected datasets. Empirically, in time series classification (TSC, 112 datasets) and in a supplementary natural language processing benchmark derived from MTEB (57 tasks), several selection strategies improve rank preservation compared with random subsets, including simple FAFI. In contrast, in recommender systems (30 datasets), the improvement of strategies over random selection is small and typically statistically insignificant. For TSC, our best-performing strategy achieves a Spearman correlation of 0.95 with the full benchmark model rankings using only five selected datasets. Additional experiments indicate that the effectiveness of selection approaches depends on both the quality of dataset representations and the scale of the benchmarking regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows selection strategies like FAFI can hit 0.95 Spearman rank correlation with just 5 datasets in TSC while beating random, but gains shrink to insignificant in recommenders.

read the letter

The main thing to know is that in time series classification with 112 datasets, strategies such as farthest-first selection reach Spearman correlations around 0.95 with the full ranking using only five datasets, and they outperform random subsets with bootstrap confidence intervals backing the comparison. The same pattern holds to a lesser degree on the MTEB-derived NLP tasks, but in the recommender systems benchmark with 30 datasets the edge over random is small and usually not statistically significant.

What the work actually adds is a framework that applies bootstrap aggregation to give valid intervals for ranking preservation, plus derived upper bounds on FAFI selection quality in terms of ranking errors. They compare clustering, A/D-optimality, FAFI, and random baselines across three domains and explicitly flag that results depend on the quality of the dataset representations used for selection.

The bootstrap step and the domain-specific empirical results are the useful parts; they give a principled way to judge whether a selection method is reliably better than chance. The bounds on FAFI are a modest theoretical touch.

The soft spots are that the reported gains are clearly benchmark-dependent, with little practical improvement in one of the three cases, and the whole approach rests on how well the chosen representations capture the performance differences that actually drive model rankings. If those representations miss key factors, the selected subsets will not preserve rankings even if the math checks out. The abstract leaves the exact construction of representations and any post-selection checks somewhat opaque, though the stress-test note finds no internal contradictions.

This is for researchers who run large-scale benchmarks and want cheaper evaluation subsets without losing the ability to rank models. It is worth a serious referee because the framework is reproducible in principle, the comparisons are grounded, and the domain variation is reported honestly rather than hidden.

Referee Report

0 major / 3 minor

Summary. The paper introduces a framework for selecting small, representative subsets of datasets from large ML benchmarks such that model rankings (measured by Spearman correlation) are preserved relative to the full set. The framework incorporates bootstrap aggregation to produce valid confidence intervals for comparing selection strategies, including clustering, A/D-optimality criteria, greedy farthest-first (FAFI), and random baselines. Upper bounds on ranking errors are derived for FAFI as a function of the number of selected datasets. Empirical evaluation on time series classification (112 datasets), an MTEB-derived NLP benchmark (57 tasks), and recommender systems (30 datasets) shows that several non-random strategies outperform random selection, with the best TSC result reaching 0.95 Spearman correlation using only five datasets; gains are smaller and often insignificant in recommender systems. The paper explicitly states that effectiveness depends on the quality of dataset representations and benchmark scale.

Significance. If the empirical results and bounds hold under the stated dependence on representations, the work supplies a statistically grounded method for reducing the cost of model evaluation while preserving reliable rankings. The bootstrap procedure for independent confidence intervals, the explicit multi-domain evaluation, and the derivation of FAFI bounds are concrete strengths that distinguish this from purely heuristic subset selection. The acknowledgment that performance hinges on representation quality avoids overclaiming generality and provides a clear direction for future refinement.

minor comments (3)

[Abstract and §3] Abstract and §3 (framework): the description of how dataset representations are obtained for clustering and FAFI is referenced but not detailed enough to assess whether they capture the performance differences that drive rankings; a short paragraph or table listing the representation features used would strengthen reproducibility.
[§4] §4 (experiments): the exact procedure for constructing the MTEB-derived 57-task benchmark (task filtering, representation construction) is mentioned only in passing; adding a brief appendix table or paragraph would allow readers to verify that the reported 0.95 correlation is not sensitive to that construction.
[Figures and tables] Figure captions and Table 1: axis labels and legend entries for the bootstrap confidence intervals should explicitly state whether intervals are percentile or BCa and whether they are adjusted for multiple comparisons across strategies.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their constructive summary of the paper and for recommending minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an empirical framework for selecting dataset subsets and evaluates strategies (clustering, A/D-optimality, FAFI, random) by their ability to preserve model rankings via Spearman correlation, using bootstrap aggregation for independent confidence intervals. Upper bounds on FAFI selection quality are derived as a function of the number of datasets. No load-bearing steps reduce by construction to fitted parameters, self-definitions, or self-citations; comparisons rely on external random baselines and bootstrap resampling rather than internal fitting to the target ranking metric. The central claims are self-contained empirical demonstrations whose validity does not presuppose the reported improvements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical assumptions for bootstrap and the utility of existing selection algorithms; no free parameters fitted to the target ranking metric or new entities are described in the abstract.

axioms (1)

standard math Bootstrap aggregation yields valid confidence intervals for the ranking preservation metrics.
Invoked to enable principled comparison of selection strategies.

pith-pipeline@v0.9.1-grok · 5786 in / 1220 out tokens · 29795 ms · 2026-06-29T05:09:47.389083+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 19 canonical work pages

[1]

David Arthur and Sergei Vassilvitskii. 2007. k-means++: The Advantages of Careful Seeding. InProceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms(New Orleans, Louisiana, USA)(SODA ’07). 1027–1035. doi:10.1145/1283383.1283494

work page doi:10.1145/1283383.1283494 2007
[2]

Anthony Bagnall et al. 2017. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. 606–660 pages

2017
[3]

Alessio Benavoli et al . 2016. Should We Really Use Post-Hoc Tests Based on Mean-Ranks? 10 pages

2016
[4]

Bernd Bischl et al. 2021. OpenML Benchmarking Suites

2021
[5]

Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Tal Arbel, Chris Pal, Gaël Varoquaux, and Pascal Vincent. 2021. Accounting for Variance in Machine Learning Benchmarks. 747–769 pages

2021
[6]

Bowman and George E

Samuel R. Bowman and George E. Dahl. 2021. What Will it Take to Fix Bench- marking in Natural Language Understanding? 4843–4855 pages. doi:10.18653/ v1/2021.naacl-main.385

2021
[7]

Pablo Castells and Alistair Moffat. 2022. Offline recommender system evaluation: Challenges and new directions.AI Mag.43, 2 (June 2022), 225–238. doi:10.1002/ aaai.12051

2022
[8]

Jin Yao Chin, Yile Chen, and Gao Cong. 2022. The Datasets Dilemma: How Much Do We Really Know About Recommendation Datasets?. InProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining(Virtual Event, AZ, USA)(WSDM ’22). Association for Computing Machinery, New York, NY, USA, 141–149. doi:10.1145/3488560.3498519

work page doi:10.1145/3488560.3498519 2022
[9]

Maximilian Christ et al. 2018. Time Series Feature Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package).Neurocomputing307 (2018), 72–77. doi:10.1016/j.neucom.2018.03.067

work page doi:10.1016/j.neucom.2018.03.067 2018
[10]

Anh Dau et al. 2019. The UCR time series archive.IEEE/CAA Journal of Automatica Sinica6 (nov 2019), 1293–1305. doi:10.1109/JAS.2019.1911747

work page doi:10.1109/jas.2019.1911747 2019
[11]

Mostafa Dehghani et al. 2021. The Benchmark Lottery. arXiv:2107.07002 [cs.LG] https://arxiv.org/abs/2107.07002

arXiv 2021
[12]

Yashar Deldjoo et al. 2021. Explaining recommender systems fairness and accu- racy through the lens of data characteristics.Information Processing & Manage- ment58, 5 (2021), 102662. doi:10.1016/j.ipm.2021.102662

work page doi:10.1016/j.ipm.2021.102662 2021
[13]

Schmidt, and Geoffrey I

Angus Dempster, Daniel F. Schmidt, and Geoffrey I. Webb. 2021. MiniRocket: A Very Fast (Almost) Deterministic Transform for Time Series Classification. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21). ACM, 248–257. doi:10.1145/3447548.3467231

work page doi:10.1145/3447548.3467231 2021
[14]

Janez Demšar. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets.Journal of Machine Learning Research7, 1 (2006), 1–30. doi:10.5555/1248547. 1248548

work page doi:10.5555/1248547 2006
[15]

Ismail Fawaz et al. 2019. Deep learning for time series classification: a review.Data Mining and Knowledge Discovery33, 4 (March 2019), 917–963. doi:10.1007/s10618- 019-00619-1

work page doi:10.1007/s10618- 2019
[16]

Maurizio Ferrari Dacrema et al. 2019. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. InProceedings of the 13th ACM Conference on Recommender Systems (RecSys ’19). ACM, 101–109. doi:10.1145/3298689.3347058

work page doi:10.1145/3298689.3347058 2019
[17]

‘An equilibrium existence result for an economy with land’

Teofilo F. Gonzalez. 1985. Clustering to minimize the maximum intercluster distance.Theoretical Computer Science38 (1985), 293–306. doi:10.1016/0304- 3975(85)90224-5

work page doi:10.1016/0304- 1985
[18]

Eamonn Keogh and Shruti Kasetty. 2003. On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration.Data Min. Knowl. Discov.7, 4 (Oct. 2003), 349–371. doi:10.1023/A:1024988512476

work page doi:10.1023/a:1024988512476 2003
[19]

Kiefer and J

J. Kiefer and J. Wolfowitz. 1959. Optimum Designs in Regression Problems. The Annals of Mathematical Statistics30, 2 (1959), 271 – 294. doi:10.1214/aoms/ 1177706252

work page doi:10.1214/aoms/ 1959
[20]

Zhihang Li et al. 2020. GP-NAS: Gaussian Process Based Neural Architecture Search. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11930–11939. doi:10.1109/CVPR42600.2020.01195

work page doi:10.1109/cvpr42600.2020.01195 2020
[21]

Carl H Lubba et al. 2019. catch22: CAnonical Time-series CHaracteristics: Selected through highly comparative time-series analysis. 1821–1852 pages

2019
[22]

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2023. Mteb: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2014–2037

2023
[23]

Christina Nießl, Moritz Herrmann, Chiara Wiedemann, Giuseppe Casalic- chio, and Anne-Laure Boulesteix. 2022. Over-optimism in benchmark stud- ies and the multiplicity of design and analysis options when interpreting their results.WIREs Data Mining and Knowledge Discovery12, 2 (2022), e1441. arXiv:https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/widm....

work page doi:10.1002/widm.1441 2022
[24]

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do ImageNet Classifiers Generalize to ImageNet?. InProceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 5389–5400. https://proceedings.mlr.press/v97/rech...

2019
[25]

John R. Rice. 1976. The Algorithm Selection Problem. Advances in Computers, Vol. 15. Elsevier, 65–118. doi:10.1016/S0065-2458(08)60520-3

work page doi:10.1016/s0065-2458(08)60520-3 1976
[26]

Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich-Keil, Moritz Hardt, John Miller, and Ludwig Schmidt. 2019. A Meta-Analysis of Overfitting in Machine Learning. InAdvances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. http...

2019
[27]

Alan Said and Alejandro Bellogín. 2014. Comparative recommender system evaluation: benchmarking recommendation frameworks. InProceedings of the 8th ACM Conference on Recommender Systems(Foster City, Silicon Valley, California, USA)(RecSys ’14). Association for Computing Machinery, New York, NY, USA, 129–136. doi:10.1145/2645710.2645746

work page doi:10.1145/2645710.2645746 2014
[28]

Valeriy Shevchenko et al. 2024. From Variability to Stability: Advancing RecSys Benchmarking Practices. InACM SIGKDD Conference(Barcelona, Spain)(KDD ’24). Association for Computing Machinery, New York, NY, USA, 5701–5712. doi:10.1145/3637528.3671655

work page doi:10.1145/3637528.3671655 2024
[29]

Jasper Snoek, Hugo Larochelle, and Ryan Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. InAdvances in Neural Information Processing Systems, F. Pereira, C.J. Burges, L. Bottou, and K. Weinberger (Eds.), Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/ 2012/file/05311655a15b75fab86956663e1819cd...

arXiv 2012
[30]

Kakade, and Matthias W

Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger
[31]

doi:10.1109/TIT.2011.2182033

Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting.IEEE Transactions on Information Theory58, 5 (2012), 3250–3265. doi:10.1109/TIT.2011.2182033

work page doi:10.1109/tit.2011.2182033 2012
[32]

Jeyan Thiyagalingam, Mallikarjun Shankar, Geoffrey Fox, and Tony Hey. 2022. Scientific machine learning benchmarks.Nature Reviews Physics4, 6 (2022), 413–420

2022
[33]

Jan van Rijn et al . 2013. OpenML: A Collaborative Science Platform.Lecture Notes in Computer Science8190, 645–649. doi:10.1007/978-3-642-40994-3_46

work page doi:10.1007/978-3-642-40994-3_46 2013
[34]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. SuperGLUE: A Stick- ier Benchmark for General-Purpose Language Understanding Systems. 3261– 3275 pages. http://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark- for-general-purpose-language-understanding-systems

2019
[35]

Mark Wilkinson et al . 2016. The FAIR Guiding Principles for scientific data management and stewardship.Scientific Data3 (03 2016)

2016
[36]

Chhavi Yadav and Léon Bottou. 2019. Cold Case: The Lost MNIST Digits. 13443– 13452 pages

2019
[37]

Xiaohua Zhai et al. 2020. A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark. (2020). arXiv:1910.04867 [cs.CV] https: //arxiv.org/abs/1910.04867 A Proofs of the statements in Section 4.1 A.1 Notation and preliminary remarks This appendix provides the auxiliary statements and proofs sup- porting Section 4.1, in particu...

Pith/arXiv arXiv 2020
[38]

Thus farthest-first with 𝑑𝑘 is equivalent to Euclidean farthest-first on the normalized set {𝑢𝑥 : 𝑥∈𝐷} , since the square function is monotone and there- fore does not change the maximizer in the greedy step. Conse- quently, the Euclidean covering bound applies to the normalized points: if Δ𝑢 =diam({𝑢 𝑥 : 𝑥∈𝐷}) , then 𝑟 (𝑒) 𝑡 ≤ 2Δ𝑢 /(𝑡 1/𝑝 − 1) and the co...

[1] [1]

David Arthur and Sergei Vassilvitskii. 2007. k-means++: The Advantages of Careful Seeding. InProceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms(New Orleans, Louisiana, USA)(SODA ’07). 1027–1035. doi:10.1145/1283383.1283494

work page doi:10.1145/1283383.1283494 2007

[2] [2]

Anthony Bagnall et al. 2017. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. 606–660 pages

2017

[3] [3]

Alessio Benavoli et al . 2016. Should We Really Use Post-Hoc Tests Based on Mean-Ranks? 10 pages

2016

[4] [4]

Bernd Bischl et al. 2021. OpenML Benchmarking Suites

2021

[5] [5]

Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Tal Arbel, Chris Pal, Gaël Varoquaux, and Pascal Vincent. 2021. Accounting for Variance in Machine Learning Benchmarks. 747–769 pages

2021

[6] [6]

Bowman and George E

Samuel R. Bowman and George E. Dahl. 2021. What Will it Take to Fix Bench- marking in Natural Language Understanding? 4843–4855 pages. doi:10.18653/ v1/2021.naacl-main.385

2021

[7] [7]

Pablo Castells and Alistair Moffat. 2022. Offline recommender system evaluation: Challenges and new directions.AI Mag.43, 2 (June 2022), 225–238. doi:10.1002/ aaai.12051

2022

[8] [8]

Jin Yao Chin, Yile Chen, and Gao Cong. 2022. The Datasets Dilemma: How Much Do We Really Know About Recommendation Datasets?. InProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining(Virtual Event, AZ, USA)(WSDM ’22). Association for Computing Machinery, New York, NY, USA, 141–149. doi:10.1145/3488560.3498519

work page doi:10.1145/3488560.3498519 2022

[9] [9]

Maximilian Christ et al. 2018. Time Series Feature Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package).Neurocomputing307 (2018), 72–77. doi:10.1016/j.neucom.2018.03.067

work page doi:10.1016/j.neucom.2018.03.067 2018

[10] [10]

Anh Dau et al. 2019. The UCR time series archive.IEEE/CAA Journal of Automatica Sinica6 (nov 2019), 1293–1305. doi:10.1109/JAS.2019.1911747

work page doi:10.1109/jas.2019.1911747 2019

[11] [11]

Mostafa Dehghani et al. 2021. The Benchmark Lottery. arXiv:2107.07002 [cs.LG] https://arxiv.org/abs/2107.07002

arXiv 2021

[12] [12]

Yashar Deldjoo et al. 2021. Explaining recommender systems fairness and accu- racy through the lens of data characteristics.Information Processing & Manage- ment58, 5 (2021), 102662. doi:10.1016/j.ipm.2021.102662

work page doi:10.1016/j.ipm.2021.102662 2021

[13] [13]

Schmidt, and Geoffrey I

Angus Dempster, Daniel F. Schmidt, and Geoffrey I. Webb. 2021. MiniRocket: A Very Fast (Almost) Deterministic Transform for Time Series Classification. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21). ACM, 248–257. doi:10.1145/3447548.3467231

work page doi:10.1145/3447548.3467231 2021

[14] [14]

Janez Demšar. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets.Journal of Machine Learning Research7, 1 (2006), 1–30. doi:10.5555/1248547. 1248548

work page doi:10.5555/1248547 2006

[15] [15]

Ismail Fawaz et al. 2019. Deep learning for time series classification: a review.Data Mining and Knowledge Discovery33, 4 (March 2019), 917–963. doi:10.1007/s10618- 019-00619-1

work page doi:10.1007/s10618- 2019

[16] [16]

Maurizio Ferrari Dacrema et al. 2019. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. InProceedings of the 13th ACM Conference on Recommender Systems (RecSys ’19). ACM, 101–109. doi:10.1145/3298689.3347058

work page doi:10.1145/3298689.3347058 2019

[17] [17]

‘An equilibrium existence result for an economy with land’

Teofilo F. Gonzalez. 1985. Clustering to minimize the maximum intercluster distance.Theoretical Computer Science38 (1985), 293–306. doi:10.1016/0304- 3975(85)90224-5

work page doi:10.1016/0304- 1985

[18] [18]

Eamonn Keogh and Shruti Kasetty. 2003. On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration.Data Min. Knowl. Discov.7, 4 (Oct. 2003), 349–371. doi:10.1023/A:1024988512476

work page doi:10.1023/a:1024988512476 2003

[19] [19]

Kiefer and J

J. Kiefer and J. Wolfowitz. 1959. Optimum Designs in Regression Problems. The Annals of Mathematical Statistics30, 2 (1959), 271 – 294. doi:10.1214/aoms/ 1177706252

work page doi:10.1214/aoms/ 1959

[20] [20]

Zhihang Li et al. 2020. GP-NAS: Gaussian Process Based Neural Architecture Search. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11930–11939. doi:10.1109/CVPR42600.2020.01195

work page doi:10.1109/cvpr42600.2020.01195 2020

[21] [21]

Carl H Lubba et al. 2019. catch22: CAnonical Time-series CHaracteristics: Selected through highly comparative time-series analysis. 1821–1852 pages

2019

[22] [22]

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2023. Mteb: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2014–2037

2023

[23] [23]

Christina Nießl, Moritz Herrmann, Chiara Wiedemann, Giuseppe Casalic- chio, and Anne-Laure Boulesteix. 2022. Over-optimism in benchmark stud- ies and the multiplicity of design and analysis options when interpreting their results.WIREs Data Mining and Knowledge Discovery12, 2 (2022), e1441. arXiv:https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/widm....

work page doi:10.1002/widm.1441 2022

[24] [24]

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do ImageNet Classifiers Generalize to ImageNet?. InProceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 5389–5400. https://proceedings.mlr.press/v97/rech...

2019

[25] [25]

John R. Rice. 1976. The Algorithm Selection Problem. Advances in Computers, Vol. 15. Elsevier, 65–118. doi:10.1016/S0065-2458(08)60520-3

work page doi:10.1016/s0065-2458(08)60520-3 1976

[26] [26]

Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich-Keil, Moritz Hardt, John Miller, and Ludwig Schmidt. 2019. A Meta-Analysis of Overfitting in Machine Learning. InAdvances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. http...

2019

[27] [27]

Alan Said and Alejandro Bellogín. 2014. Comparative recommender system evaluation: benchmarking recommendation frameworks. InProceedings of the 8th ACM Conference on Recommender Systems(Foster City, Silicon Valley, California, USA)(RecSys ’14). Association for Computing Machinery, New York, NY, USA, 129–136. doi:10.1145/2645710.2645746

work page doi:10.1145/2645710.2645746 2014

[28] [28]

Valeriy Shevchenko et al. 2024. From Variability to Stability: Advancing RecSys Benchmarking Practices. InACM SIGKDD Conference(Barcelona, Spain)(KDD ’24). Association for Computing Machinery, New York, NY, USA, 5701–5712. doi:10.1145/3637528.3671655

work page doi:10.1145/3637528.3671655 2024

[29] [29]

Jasper Snoek, Hugo Larochelle, and Ryan Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. InAdvances in Neural Information Processing Systems, F. Pereira, C.J. Burges, L. Bottou, and K. Weinberger (Eds.), Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/ 2012/file/05311655a15b75fab86956663e1819cd...

arXiv 2012

[30] [30]

Kakade, and Matthias W

Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger

[31] [31]

doi:10.1109/TIT.2011.2182033

Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting.IEEE Transactions on Information Theory58, 5 (2012), 3250–3265. doi:10.1109/TIT.2011.2182033

work page doi:10.1109/tit.2011.2182033 2012

[32] [32]

Jeyan Thiyagalingam, Mallikarjun Shankar, Geoffrey Fox, and Tony Hey. 2022. Scientific machine learning benchmarks.Nature Reviews Physics4, 6 (2022), 413–420

2022

[33] [33]

Jan van Rijn et al . 2013. OpenML: A Collaborative Science Platform.Lecture Notes in Computer Science8190, 645–649. doi:10.1007/978-3-642-40994-3_46

work page doi:10.1007/978-3-642-40994-3_46 2013

[34] [34]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. SuperGLUE: A Stick- ier Benchmark for General-Purpose Language Understanding Systems. 3261– 3275 pages. http://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark- for-general-purpose-language-understanding-systems

2019

[35] [35]

Mark Wilkinson et al . 2016. The FAIR Guiding Principles for scientific data management and stewardship.Scientific Data3 (03 2016)

2016

[36] [36]

Chhavi Yadav and Léon Bottou. 2019. Cold Case: The Lost MNIST Digits. 13443– 13452 pages

2019

[37] [37]

Xiaohua Zhai et al. 2020. A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark. (2020). arXiv:1910.04867 [cs.CV] https: //arxiv.org/abs/1910.04867 A Proofs of the statements in Section 4.1 A.1 Notation and preliminary remarks This appendix provides the auxiliary statements and proofs sup- porting Section 4.1, in particu...

Pith/arXiv arXiv 2020

[38] [38]

Thus farthest-first with 𝑑𝑘 is equivalent to Euclidean farthest-first on the normalized set {𝑢𝑥 : 𝑥∈𝐷} , since the square function is monotone and there- fore does not change the maximizer in the greedy step. Conse- quently, the Euclidean covering bound applies to the normalized points: if Δ𝑢 =diam({𝑢 𝑥 : 𝑥∈𝐷}) , then 𝑟 (𝑒) 𝑡 ≤ 2Δ𝑢 /(𝑡 1/𝑝 − 1) and the co...