Bake off redux: a review and experimental evaluation of recent time series classification algorithms

Anthony Bagnall; Matthew Middlehurst; Patrick Sch\"afer

arxiv: 2304.13029 · v3 · submitted 2023-04-25 · 💻 cs.LG

Bake off redux: a review and experimental evaluation of recent time series classification algorithms

Matthew Middlehurst , Patrick Sch\"afer , Anthony Bagnall This is my paper

Pith reviewed 2026-05-24 08:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords time series classificationUCR archivealgorithm comparisonHydra+MultiROCKETHIVE-COTEv2bake offmachine learning

0 comments

The pith

Hydra+MultiROCKET and HIVE-COTEv2 significantly outperform other time series classification algorithms on the expanded UCR archive.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper revisits a 2017 comparison of time series classification methods and updates it with newer algorithms, an extended taxonomy that adds convolution, feature-based, and deep learning categories, and 30 additional datasets. It evaluates representatives from each category against the prior best performers using the full UCR collection of 112 problems plus the new ones. The central finding is that two hybrid algorithms stand out as clearly superior on both the original and fresh problems. A reader would care because the work supplies an updated practical ranking that can guide which methods to try first on new time series tasks. The evaluation also tests whether recent developments have shifted the frontier beyond the 2017 leaders.

Core claim

On the UCR time series classification archive, now expanded to 112 datasets plus 30 new ones, the hybrid algorithms Hydra+MultiROCKET and HIVE-COTEv2 achieve significantly higher accuracy than representatives from distance, interval, shapelet, dictionary, convolution, feature-based, and deep learning categories, as well as outperforming the previous best-of-category methods from 2017.

What carries the argument

The bake-off evaluation that places each algorithm into a feature-extraction category and measures statistical superiority on the UCR archive.

If this is right

Convolution-based and feature-based algorithms improve their respective categories but remain behind the top hybrids.
Deep learning approaches are included in the comparison yet do not lead the rankings.
The addition of 30 new datasets provides a check on whether earlier conclusions hold for fresh problems.
The taxonomy is extended to three new categories to accommodate recent algorithmic directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future algorithm design may benefit from explicitly combining elements that Hydra+MultiROCKET and HIVE-COTEv2 use.
Practitioners facing new time series tasks could start with these two hybrids rather than older category leaders.
The results raise the question of whether similar hybrid constructions would dominate in related tasks such as time series forecasting or anomaly detection.

Load-bearing premise

Performance differences observed on the UCR archive datasets reflect real differences that would appear on other time series classification tasks.

What would settle it

A collection of time series datasets, collected independently of the UCR archive, on which neither Hydra+MultiROCKET nor HIVE-COTEv2 shows a statistically significant accuracy advantage over the previous leading methods.

read the original abstract

In 2017, a research paper compared 18 Time Series Classification (TSC) algorithms on 85 datasets from the University of California, Riverside (UCR) archive. This study, commonly referred to as a `bake off', identified that only nine algorithms performed significantly better than the Dynamic Time Warping (DTW) and Rotation Forest benchmarks that were used. The study categorised each algorithm by the type of feature they extract from time series data, forming a taxonomy of five main algorithm types. This categorisation of algorithms alongside the provision of code and accessible results for reproducibility has helped fuel an increase in popularity of the TSC field. Over six years have passed since this bake off, the UCR archive has expanded to 112 datasets and there have been a large number of new algorithms proposed. We revisit the bake off, seeing how each of the proposed categories have advanced since the original publication, and evaluate the performance of newer algorithms against the previous best-of-category using an expanded UCR archive. We extend the taxonomy to include three new categories to reflect recent developments. Alongside the originally proposed distance, interval, shapelet, dictionary and hybrid based algorithms, we compare newer convolution and feature based algorithms as well as deep learning approaches. We introduce 30 classification datasets either recently donated to the archive or reformatted to the TSC format, and use these to further evaluate the best performing algorithm from each category. Overall, we find that two recently proposed algorithms, Hydra+MultiROCKET and HIVE-COTEv2, perform significantly better than other approaches on both the current and new TSC problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This updates the 2017 TSC bake-off with new algorithms and 30 added datasets, and finds Hydra+MultiROCKET and HIVE-COTEv2 ahead on the expanded UCR collection.

read the letter

The paper revisits the 2017 time series classification bake-off. It keeps the same category structure, adds three new ones for convolution, feature-based, and deep learning methods, expands the archive, and runs the recent algorithms against the prior category leaders. The headline result is that Hydra+MultiROCKET and HIVE-COTEv2 come out ahead on both the original 112 datasets and the 30 new ones, with significance claims attached to those comparisons. That is the concrete advance: a refreshed ranking on more data using the same evaluation style as the earlier study. The work is useful because it maintains the taxonomy and reproducibility focus of the 2017 paper, so readers can see where each line of work has moved. The new datasets are presented as recently donated or reformatted, which at least widens the test bed. The main limitation is that the paper gives no side-by-side statistics on how the 30 new sets differ from the earlier ones in length, class balance, noise, or domain coverage. Without that, the superiority finding stays conditional on the UCR collection's existing distribution rather than proving broader robustness. The abstract mentions systematic comparisons and significance tests, but the strength of those tests depends on details like multiple-comparison corrections that are not visible here. This paper is for people who work on time series classification and want an updated picture of which approaches currently lead on standard benchmarks. It is the sort of empirical consolidation that keeps the subfield from drifting. It deserves peer review because the experimental scope is clear and the claims rest on reproducible comparisons rather than new theory.

Referee Report

2 major / 1 minor

Summary. The paper revisits the 2017 TSC 'bake off' by expanding the UCR archive from 85 to 112 datasets, adding 30 new datasets, extending the original five-category taxonomy to include convolution-based, feature-based, and deep learning approaches, and benchmarking recent algorithms against prior best-of-category methods. It concludes that Hydra+MultiROCKET and HIVE-COTEv2 perform significantly better than other approaches on both the expanded archive and the new datasets.

Significance. If the central empirical claims hold after addressing the noted issues, the work would serve as a useful updated reference for the TSC community, documenting progress in hybrid and convolutional methods over six years while supplying code and results to support reproducibility. This strengthens the field's empirical foundation without introducing new theoretical derivations.

major comments (2)

[Data and experimental setup] Introduction of the 30 new datasets (described as 'recently donated or reformatted'): the manuscript provides no quantitative comparison of their statistical properties (length distributions, class counts, imbalance ratios, or domain coverage) against the original 85, which is load-bearing for the claim that the two algorithms 'perform significantly better ... on both the current and new TSC problems' since the new sets may simply replicate the UCR profile.
[Results and statistical analysis] Results section on significance claims: the abstract and evaluation state that Hydra+MultiROCKET and HIVE-COTEv2 'perform significantly better' across 142 datasets and multiple algorithms, yet no details are given on the exact statistical tests, handling of multiple comparisons (across algorithms, categories, and datasets), or corrections such as Bonferroni or FDR, which directly affects the support for the superiority conclusion.

minor comments (1)

[Taxonomy extension] The extended taxonomy is presented in text; a single summary table or diagram would improve clarity for readers tracking the shift from the original five categories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity and completeness while preserving the core empirical contributions.

read point-by-point responses

Referee: [Data and experimental setup] Introduction of the 30 new datasets (described as 'recently donated or reformatted'): the manuscript provides no quantitative comparison of their statistical properties (length distributions, class counts, imbalance ratios, or domain coverage) against the original 85, which is load-bearing for the claim that the two algorithms 'perform significantly better ... on both the current and new TSC problems' since the new sets may simply replicate the UCR profile.

Authors: We agree that explicitly comparing the statistical properties of the 30 new datasets to the original 85 UCR datasets would strengthen the manuscript and better support the generalizability claim. The new datasets were selected to expand coverage (including recently donated and reformatted series from diverse domains), but the current version does not include a side-by-side quantitative summary. In the revision we will add a table (or subsection) reporting length distributions, number of classes, imbalance ratios, and domain categories for both collections, along with any notable differences. This addresses the concern directly without altering the experimental results. revision: yes
Referee: [Results and statistical analysis] Results section on significance claims: the abstract and evaluation state that Hydra+MultiROCKET and HIVE-COTEv2 'perform significantly better' across 142 datasets and multiple algorithms, yet no details are given on the exact statistical tests, handling of multiple comparisons (across algorithms, categories, and datasets), or corrections such as Bonferroni or FDR, which directly affects the support for the superiority conclusion.

Authors: The significance statements follow the methodology of the original 2017 bake-off and subsequent TSC literature: critical difference diagrams based on average ranks, with pairwise Wilcoxon signed-rank tests and Holm correction for multiple comparisons. However, we acknowledge that the manuscript text does not spell out these details or the exact correction procedure. We will add a concise subsection (or paragraph in the evaluation section) describing the statistical tests, the multiple-comparison correction applied, and the threshold used. This will make the support for the superiority claims fully transparent. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study contains no derivation chain or fitted inputs that reduce to self-reference.

full rationale

The paper performs an experimental comparison of TSC algorithms on the UCR archive (112 existing + 30 added datasets). Its central claims rest on reported accuracy, rank, and significance test outcomes obtained by executing published algorithms on these external collections. No equations, parameters fitted to a subset then re-predicted, self-definitional constructs, or load-bearing self-citations appear in the text. The taxonomy extension is descriptive rather than deductive, and the addition of new datasets is presented as an empirical expansion without any circular renaming or ansatz smuggling. The evaluation protocol is therefore self-contained against the stated benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the UCR archive as a benchmark and the appropriateness of standard classifier comparison protocols; no free parameters or invented entities are introduced by this paper itself.

axioms (1)

domain assumption Standard statistical significance tests for algorithm comparison are valid and correctly applied without bias from multiple testing or dataset selection.
The paper uses these tests to declare superior performance of the two new algorithms.

pith-pipeline@v0.9.0 · 5825 in / 1239 out tokens · 35624 ms · 2026-05-24T08:47:04.509120+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extend the taxonomy to include three new categories... convolution and feature based algorithms as well as deep learning approaches... two recently proposed algorithms, Hydra+MultiROCKET and HIVE-COTEv2, perform significantly better
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Critical difference diagrams... pairwise one-sided Wilcoxon signed-rank tests... 30 resamples

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Soft-MSM: Differentiable Context-Aware Elastic Alignment for Time Series
cs.LG 2026-04 unverdicted novelty 7.0

Soft-MSM is a smooth, gradient-enabled version of the context-aware MSM distance for time series alignment that outperforms Soft-DTW alternatives in clustering and nearest-centroid classification.
Unsupervised Feature Based Algorithms for Time Series Extrinsic Regression
cs.LG 2023-05 conditional novelty 6.0

FreshPRINCE and DrCIF, two new unsupervised feature-based regressors adapted from time series classification, significantly outperform other methods on an expanded archive of 63 TSER problems and are the only ones to ...