Bake off redux: a review and experimental evaluation of recent time series classification algorithms
Pith reviewed 2026-05-24 08:47 UTC · model grok-4.3
The pith
Hydra+MultiROCKET and HIVE-COTEv2 significantly outperform other time series classification algorithms on the expanded UCR archive.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the UCR time series classification archive, now expanded to 112 datasets plus 30 new ones, the hybrid algorithms Hydra+MultiROCKET and HIVE-COTEv2 achieve significantly higher accuracy than representatives from distance, interval, shapelet, dictionary, convolution, feature-based, and deep learning categories, as well as outperforming the previous best-of-category methods from 2017.
What carries the argument
The bake-off evaluation that places each algorithm into a feature-extraction category and measures statistical superiority on the UCR archive.
If this is right
- Convolution-based and feature-based algorithms improve their respective categories but remain behind the top hybrids.
- Deep learning approaches are included in the comparison yet do not lead the rankings.
- The addition of 30 new datasets provides a check on whether earlier conclusions hold for fresh problems.
- The taxonomy is extended to three new categories to accommodate recent algorithmic directions.
Where Pith is reading between the lines
- Future algorithm design may benefit from explicitly combining elements that Hydra+MultiROCKET and HIVE-COTEv2 use.
- Practitioners facing new time series tasks could start with these two hybrids rather than older category leaders.
- The results raise the question of whether similar hybrid constructions would dominate in related tasks such as time series forecasting or anomaly detection.
Load-bearing premise
Performance differences observed on the UCR archive datasets reflect real differences that would appear on other time series classification tasks.
What would settle it
A collection of time series datasets, collected independently of the UCR archive, on which neither Hydra+MultiROCKET nor HIVE-COTEv2 shows a statistically significant accuracy advantage over the previous leading methods.
read the original abstract
In 2017, a research paper compared 18 Time Series Classification (TSC) algorithms on 85 datasets from the University of California, Riverside (UCR) archive. This study, commonly referred to as a `bake off', identified that only nine algorithms performed significantly better than the Dynamic Time Warping (DTW) and Rotation Forest benchmarks that were used. The study categorised each algorithm by the type of feature they extract from time series data, forming a taxonomy of five main algorithm types. This categorisation of algorithms alongside the provision of code and accessible results for reproducibility has helped fuel an increase in popularity of the TSC field. Over six years have passed since this bake off, the UCR archive has expanded to 112 datasets and there have been a large number of new algorithms proposed. We revisit the bake off, seeing how each of the proposed categories have advanced since the original publication, and evaluate the performance of newer algorithms against the previous best-of-category using an expanded UCR archive. We extend the taxonomy to include three new categories to reflect recent developments. Alongside the originally proposed distance, interval, shapelet, dictionary and hybrid based algorithms, we compare newer convolution and feature based algorithms as well as deep learning approaches. We introduce 30 classification datasets either recently donated to the archive or reformatted to the TSC format, and use these to further evaluate the best performing algorithm from each category. Overall, we find that two recently proposed algorithms, Hydra+MultiROCKET and HIVE-COTEv2, perform significantly better than other approaches on both the current and new TSC problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper revisits the 2017 TSC 'bake off' by expanding the UCR archive from 85 to 112 datasets, adding 30 new datasets, extending the original five-category taxonomy to include convolution-based, feature-based, and deep learning approaches, and benchmarking recent algorithms against prior best-of-category methods. It concludes that Hydra+MultiROCKET and HIVE-COTEv2 perform significantly better than other approaches on both the expanded archive and the new datasets.
Significance. If the central empirical claims hold after addressing the noted issues, the work would serve as a useful updated reference for the TSC community, documenting progress in hybrid and convolutional methods over six years while supplying code and results to support reproducibility. This strengthens the field's empirical foundation without introducing new theoretical derivations.
major comments (2)
- [Data and experimental setup] Introduction of the 30 new datasets (described as 'recently donated or reformatted'): the manuscript provides no quantitative comparison of their statistical properties (length distributions, class counts, imbalance ratios, or domain coverage) against the original 85, which is load-bearing for the claim that the two algorithms 'perform significantly better ... on both the current and new TSC problems' since the new sets may simply replicate the UCR profile.
- [Results and statistical analysis] Results section on significance claims: the abstract and evaluation state that Hydra+MultiROCKET and HIVE-COTEv2 'perform significantly better' across 142 datasets and multiple algorithms, yet no details are given on the exact statistical tests, handling of multiple comparisons (across algorithms, categories, and datasets), or corrections such as Bonferroni or FDR, which directly affects the support for the superiority conclusion.
minor comments (1)
- [Taxonomy extension] The extended taxonomy is presented in text; a single summary table or diagram would improve clarity for readers tracking the shift from the original five categories.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity and completeness while preserving the core empirical contributions.
read point-by-point responses
-
Referee: [Data and experimental setup] Introduction of the 30 new datasets (described as 'recently donated or reformatted'): the manuscript provides no quantitative comparison of their statistical properties (length distributions, class counts, imbalance ratios, or domain coverage) against the original 85, which is load-bearing for the claim that the two algorithms 'perform significantly better ... on both the current and new TSC problems' since the new sets may simply replicate the UCR profile.
Authors: We agree that explicitly comparing the statistical properties of the 30 new datasets to the original 85 UCR datasets would strengthen the manuscript and better support the generalizability claim. The new datasets were selected to expand coverage (including recently donated and reformatted series from diverse domains), but the current version does not include a side-by-side quantitative summary. In the revision we will add a table (or subsection) reporting length distributions, number of classes, imbalance ratios, and domain categories for both collections, along with any notable differences. This addresses the concern directly without altering the experimental results. revision: yes
-
Referee: [Results and statistical analysis] Results section on significance claims: the abstract and evaluation state that Hydra+MultiROCKET and HIVE-COTEv2 'perform significantly better' across 142 datasets and multiple algorithms, yet no details are given on the exact statistical tests, handling of multiple comparisons (across algorithms, categories, and datasets), or corrections such as Bonferroni or FDR, which directly affects the support for the superiority conclusion.
Authors: The significance statements follow the methodology of the original 2017 bake-off and subsequent TSC literature: critical difference diagrams based on average ranks, with pairwise Wilcoxon signed-rank tests and Holm correction for multiple comparisons. However, we acknowledge that the manuscript text does not spell out these details or the exact correction procedure. We will add a concise subsection (or paragraph in the evaluation section) describing the statistical tests, the multiple-comparison correction applied, and the threshold used. This will make the support for the superiority claims fully transparent. revision: yes
Circularity Check
Empirical benchmarking study contains no derivation chain or fitted inputs that reduce to self-reference.
full rationale
The paper performs an experimental comparison of TSC algorithms on the UCR archive (112 existing + 30 added datasets). Its central claims rest on reported accuracy, rank, and significance test outcomes obtained by executing published algorithms on these external collections. No equations, parameters fitted to a subset then re-predicted, self-definitional constructs, or load-bearing self-citations appear in the text. The taxonomy extension is descriptive rather than deductive, and the addition of new datasets is presented as an empirical expansion without any circular renaming or ansatz smuggling. The evaluation protocol is therefore self-contained against the stated benchmarks and does not reduce any result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard statistical significance tests for algorithm comparison are valid and correctly applied without bias from multiple testing or dataset selection.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We extend the taxonomy to include three new categories... convolution and feature based algorithms as well as deep learning approaches... two recently proposed algorithms, Hydra+MultiROCKET and HIVE-COTEv2, perform significantly better
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Critical difference diagrams... pairwise one-sided Wilcoxon signed-rank tests... 30 resamples
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Soft-MSM: Differentiable Context-Aware Elastic Alignment for Time Series
Soft-MSM is a smooth, gradient-enabled version of the context-aware MSM distance for time series alignment that outperforms Soft-DTW alternatives in clustering and nearest-centroid classification.
-
Unsupervised Feature Based Algorithms for Time Series Extrinsic Regression
FreshPRINCE and DrCIF, two new unsupervised feature-based regressors adapted from time series classification, significantly outperform other methods on an expanded archive of 63 TSER problems and are the only ones to ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.