pith. sign in

arxiv: 2401.00109 · v3 · submitted 2023-12-30 · 💻 cs.DS · cs.DB· cs.DC· cs.SC

Parallel Two-Stage Approach for Joint Symbolic Approximation of Time Series

Pith reviewed 2026-05-24 04:58 UTC · model grok-4.3

classification 💻 cs.DS cs.DBcs.DCcs.SC
keywords time seriessymbolic approximationparallel processingdigitizationscalabilitydata compressionABBA
0
0 comments X

The pith

A parallel two-stage approach decouples local compression from global digitization for scalable joint symbolic approximation of time series.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method for symbolic approximation that works on many time series at once instead of handling them one after another. It partitions the series into independent parts that can be compressed in parallel and then uses a shared dictionary to assign symbols to all of them together. A two-stage process does local aggregation first and then merges globally to avoid moving all data at once. This is meant to keep the quality of the representation high while making it fast enough for large collections of data on modern computers. If it works, it means symbolic methods can be used as a practical tool for big temporal datasets without losing the shape-preserving benefits.

Core claim

The proposed method decouples local compression from global digitization by partitioning time series into independent domains for parallel compression and digitizing the resulting pieces using a shared global dictionary, with a two-stage parallel digitization scheme that performs aggregation locally before a global merge without full-data reassignment.

What carries the argument

Decoupling of local compression from global digitization with a shared global dictionary and two-stage parallel aggregation.

If this is right

  • Consistent symbols are produced across different time series.
  • Runtime is substantially reduced on multicore and distributed-memory systems.
  • Reconstruction quality remains competitive with sequential per-series methods.
  • The method can serve as an efficient high-level parallel tool for large-scale temporal data analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may allow pattern discovery across multiple series that independent processing cannot achieve.
  • It could be adapted to other symbolic representation techniques to improve their scalability.
  • In practice, the local aggregation step might lower communication costs in distributed computing environments.

Load-bearing premise

That partitioning time series into independent domains for parallel compression and using a shared global dictionary will preserve the shape-preserving and consistency properties without introducing significant quality loss.

What would settle it

Running the method on the paper's large synthetic benchmarks and finding that reconstruction quality falls substantially below the sequential ABBA baseline or that runtime savings are negligible.

Figures

Figures reproduced from arXiv: 2401.00109 by Xinye Chen.

Figure 2
Figure 2. Figure 2: Image segmentation with VQ and GA: The three [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: 2-dimensional data partition using vector quantiza [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: shows the simulation of k-means++ and sampling-based k-means (sampling 𝑟 = 10% of data) on Gaussian blobs data with 10 clusters, proceeding by increasing data sizes. The result is as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Value of 𝛼 as tol increases. 5 JOINT SYMBOLIC APPROXIMATION After discussing the two ABBA methods, we introduce a joint sym￾bolic aggregate approximation on how to perform fast ABBA sym￾bolization on multiple time series while retaining the symbolic consistency. This joint approximation framework is also applicable to large-scale univariate time series and multivariate time series. The ideal case of symbol… view at source ↗
Figure 5
Figure 5. Figure 5: Fork and join model: since there is no dependency [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reconstruction of symbolic representation for [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: MSE of JABBA with varying number of partitions [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Runtime of JABBA with varying number of parti [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Speedup. introduce a joint symbolic approximation method that improves the speed of ABBA symbolization and achieves symbolic consistency in each representation. The framework of joint symbolic approxima￾tion enables parallel computing for further speedup. Attributed to the symbolic consistency, a manipulation of natural language pro￾cessing and text mining techniques is available in time series. The conver… view at source ↗
read the original abstract

As time-series applications grow larger, there is increasing demand for symbolic representations that are compact, accurate, and scalable across many signals and computing resources. Current ABBA-based symbolic approximation methods produce high-quality, shape-preserving representations, but they handle each time series separately and sequentially. This means they do not ensure consistent symbols across different series and cannot fully exploit modern multicore systems and distributed-memory systems. This paper presents a joint symbolic time-series approximation method for large-scale time series. The proposed method decouples local compression from global digitization: (i) time series are partitioned into independent domains that can be compressed in parallel, and (ii) the resulting pieces are digitized using a shared global dictionary. To further improve scalability, we introduce a two-stage parallel digitization scheme, in which aggregation is first performed locally and then merged globally without requiring a full-data reassignment step. Extensive experiments on time-series datasets and large synthetic benchmarks show that our approach maintains competitive reconstruction quality while substantially reducing runtime. These results show that joint symbolic approximation can serve as an efficient, high-level parallel tool for analyzing large-scale temporal data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a parallel two-stage joint symbolic approximation method for time series based on ABBA. It decouples local compression (partitioning series into independent domains for parallel processing) from global digitization (using a shared dictionary via local aggregation followed by global merge without full reassignment), claiming that this yields consistent symbols across series, competitive reconstruction quality, and substantially reduced runtime compared to sequential per-series ABBA, as supported by experiments on real datasets and large synthetic benchmarks.

Significance. If the quality preservation claim holds under the two-stage merge, the method would enable scalable, consistent symbolic representations for large-scale time-series analysis on multicore and distributed systems. The extensive experiments on synthetic benchmarks provide a concrete strength for assessing practical performance.

major comments (1)
  1. [§3.2] §3.2: The global merge step operates solely on aggregated local statistics without a full-data reassignment pass. When partition distributions differ, this can shift centroids or symbol boundaries, which directly risks violating the shape-preserving property guaranteed by sequential ABBA per series. No analytic bound is derived showing that the resulting merge error remains controlled by the local aggregation error; this is load-bearing for the central claim that quality remains competitive.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address the major comment regarding the global merge step in detail below.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The global merge step operates solely on aggregated local statistics without a full-data reassignment pass. When partition distributions differ, this can shift centroids or symbol boundaries, which directly risks violating the shape-preserving property guaranteed by sequential ABBA per series. No analytic bound is derived showing that the resulting merge error remains controlled by the local aggregation error; this is load-bearing for the central claim that quality remains competitive.

    Authors: We acknowledge that the global merge relies on aggregated local statistics and that no analytic bound on the approximation error is provided in the manuscript. The two-stage approach is motivated by the need for scalability in large-scale settings where a full reassignment would be prohibitive. The aggregation step collects sufficient statistics to construct a global dictionary that approximates what a full pass would yield. While differing partition distributions could in principle affect centroids, our design uses a merge that combines the local dictionaries in a way that preserves the overall distribution. The central claim of competitive quality is supported by extensive experiments on real datasets and large synthetic benchmarks, where reconstruction errors are shown to be comparable to sequential ABBA even under varied partitioning. We agree that a theoretical bound would be valuable but note that deriving tight bounds for such merge operations in symbolic approximation is non-trivial; the empirical validation serves as the primary evidence for practicality. revision: partial

Circularity Check

0 steps flagged

No circularity: algorithmic proposal validated by external experiments

full rationale

The paper introduces a parallel two-stage method that partitions time series for local compression followed by global dictionary digitization via aggregation and merge. Claims rest on runtime reduction and competitive reconstruction quality shown via experiments on real and synthetic datasets. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce the central claims to inputs by construction. The approach is presented as an engineering decoupling with empirical checks, not a self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no specific details on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5722 in / 1029 out tokens · 32942 ms · 2026-05-24T04:58:38.880409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Ragha- van. 1998. Automatic subspace clustering of high dimensional data for data mining applications. Proceedings of the ACM SIGMOD International Conference on Management of Data 27 (1998), 94–105

  2. [2]

    2007.k-means++: the advantages of careful seeding

    David Arthur and Sergei Vassilvitskii. 2007.k-means++: the advantages of careful seeding. In SODA ’07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1027– 1035

  3. [3]

    Bagnall, Hoang Anh Dau, Jason Lines, Michael Flynn, James Large, Aaron Bostrom, Paul Southam, and Eamonn J

    Anthony J. Bagnall, Hoang Anh Dau, Jason Lines, Michael Flynn, James Large, Aaron Bostrom, Paul Southam, and Eamonn J. Keogh. 2018. The UEA multivariate time series classification archive. CoRR (2018)

  4. [4]

    Konstantinos Bountrogiannis, George Tzagkarakis, and Panagiotis Tsakalides

  5. [5]

    IEEE Transactions on Knowledge and Data Engineering 35, 6 (2023), 5752–5766

    Distribution Agnostic Symbolic Representations for Time Series Dimension- ality Reduction and Online Anomaly Detection. IEEE Transactions on Knowledge and Data Engineering 35, 6 (2023), 5752–5766

  6. [6]

    Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. 2013. Density- Based Clustering Based on Hierarchical Density Estimates. InAdvances in Knowl- edge Discovery and Data Mining . Springer, 160–172

  7. [7]

    Xinye Chen and Stefan Güttel. 2022. An Efficient Aggregation Method for the Symbolic Representation of Temporal Data. ACM Transactions on Knowledge Discovery from Data (2022)

  8. [8]

    Xinye Chen and Stefan Güttel. 2022. Fast and explainable clustering based on sorting. (2022), 25. arXiv:2202.01456

  9. [9]

    Sanjoy Dasgupta and Yoav Freund. 2008. Random Projection Trees and Low Dimensional Manifolds. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing (STOC ’08). ACM, 537–546

  10. [10]

    Sanjoy Dasgupta and Yoav Freund. 2009. Random Projection Trees for Vector Quantization. IEEE Transactions on Information Theory 55, 7 (2009), 3229–3242

  11. [11]

    Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn Keogh

  12. [12]

    IEEE/CAA Journal of Automatica Sinica 6, 6 (2019), 1293–1305

    The UCR time series archive. IEEE/CAA Journal of Automatica Sinica 6, 6 (2019), 1293–1305

  13. [13]

    Drineas, A

    P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. 2004. Clustering large graphs via the singular value decomposition. Machine Learning 56, 1–3 (2004), 9–33

  14. [14]

    Steven Elsworth and Stefan Güttel. 2020. ABBA: adaptive Brownian bridge-based symbolic aggregation of time series. Data Mining and Knowledge Discovery 34 (2020), 1175–1200

  15. [15]

    Steven Elsworth and Stefan Güttel. 2020. Time series forecasting using LSTM networks: A symbolic approach. (2020), 12. arXiv:2003.05672

  16. [16]

    Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density- Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226–231

  17. [17]

    Yifeng Gao and Jessica Lin. 2019. Discovering Subdimensional Motifs of Different Lengths in Large-Scale Multivariate Time Series. InIEEE International Conference on Data Mining. 220–229

  18. [18]

    Robert Gray. 1984. Vector quantization. IEEE ASSP Magazine 1, 2 (1984), 4–29

  19. [19]

    Keogh, J

    E. Keogh, J. Lin, and A. Fu. 2005. HOT SAX: efficiently finding the most un- usual time series subsequence. In IEEE International Conference on Data Mining (ICDM’05). 1–8

  20. [20]

    Eamonn Keogh, Stefano Lonardi, and Bill ’Yuan-chi’ Chiu. 2002. Finding Surpris- ing Patterns in a Time Series Database in Linear Time and Space. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’02). ACM, 550–556

  21. [21]

    Xiaosheng Li and Jessica Lin. 2017. Linear Time Complexity Time Series Classi- fication with Bag-of-Pattern-Features. In IEEE International Conference on Data Mining. 277–286

  22. [22]

    Xiaosheng Li, Jessica Lin, and Liang Zhao. 2021. Time Series Clustering in Linear Time Complexity. Data Mining and Knowledge Discovery 35, 6 (2021), 2369–2388

  23. [23]

    Yuan Li and Jessica Lin. 2010. Approximate Variable-Length Time Series Motif Discovery Using Grammar Inference. In Proceedings of the 10th International Workshop on Multimedia Data Mining (MDMKDD ’10). ACM, 9

  24. [24]

    Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, 2–11

  25. [25]

    Jessica Lin, Eamonn Keogh, Li Wei, and Stefano Lonardi. 2007. Experiencing SAX: a novel symbolic representation of time series. Data Mining and Knowledge Discovery 15, 2 (2007), 107–144

  26. [26]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. European Conference on Computer Vision (2014), 740–755

  27. [27]

    Lkhagva, Yu Suzuki, and K

    B. Lkhagva, Yu Suzuki, and K. Kawagoe. 2006. New Time Series Data Represen- tation ESAX for Financial Applications. In 22nd International Conference on Data Engineering Workshops (ICDEW’06). x115–x115. https://doi.org/10.1109/ICDEW. 2006.99

  28. [28]

    S. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Infor- mation Theory 28, 2 (1982), 129–137

  29. [29]

    Stuart P. Lloyd. 1982. Least squares quantization in PCM. Transactions on Information Theory 28 (1982), 129–137

  30. [30]

    Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. 2012. The planar k-means problem is NP-hard. Theoretical Computer Science 442 (2012), 13–21. Special Issue on the Workshop on Algorithms and Computation (WALCOM 2009)

  31. [31]

    Simon Malinowski, Thomas Guyet, René Quiniou, and Romain Tavenard. 2013. 1d-SAX: A Novel Symbolic Representation for Time Series. In Advances in Intel- ligent Data Analysis XII

  32. [32]

    Thach Le Nguyen and Georgiana Ifrim. 2023. Fast Time Series Classification with Random Symbolic Subsequences. InAdvanced Analytics and Learning on Temporal Data: 7th ECML PKDD Workshop, AALTD 2022, Grenoble, France, September 19–23, 2022, Revised Selected Papers . Springer, 50—-65

  33. [33]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

  34. [34]

    N. Ruta, N. Sawada, K. McKeough, M. Behrisch, and J. Beyer. 2019. SAX Nav- igator: Time Series Exploration through Hierarchical Clustering. In 2019 IEEE Visualization Conference. IEEE, 236–240

  35. [35]

    Boedi- hardjo, Crystal Chen, and Susan Frankenstein

    Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Boedi- hardjo, Crystal Chen, and Susan Frankenstein. 2015. Time series anomaly dis- covery with grammar-based compression.. In 18th International Conference on Extending Database Technology. OpenProceedings.org, 481–492

  36. [36]

    Pavel Senin and Sergey Malinchik. 2013. SAX-VSM: Interpretable Time Se- ries Classification Using SAX and Vector Space Model. In IEEE International Conference on Data Mining . 1175–1180

  37. [37]

    Yu and Jianbo Shi

    Stella X. Yu and Jianbo Shi. 2003. Multiclass spectral clustering. In Proceedings of the Ninth IEEE International Conference on Computer Vision , Vol. 2. IEEE, 313

  38. [38]

    Yuncong Yu, Tim Becker, Le Minh Trinh, and Michael Behrisch. 2023. SAXRegEx: Multivariate time series pattern search with symbolic representation, regular expression, and query expansion. Computers & Graphics 112 (2023), 13–21

  39. [39]

    Shengdong Zhang, Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, and Mohak Shah. 2017. Deep learning on symbolic representations for large-scale heterogeneous time-series event prediction. In2017 IEEE International Conference on Acoustics, Speech and Signal Processing . 5970–5974

  40. [40]

    Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data . ACM, 103–114