pith. machine review for the scientific record. sign in

arxiv: 2605.05725 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:39 UTC · model grok-4.3

classification 💻 cs.AI
keywords time series anomaly detectionmulti-agent LLMspecialized analyzersunivariate time seriesevidence consolidationinterpretabilitysynthetic in-context examples
0
0 comments X

The pith

A multi-agent framework breaks time series anomaly detection into four specialized families to raise both accuracy and interpretability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a single general-purpose language model is not enough for reliable univariate time series anomaly detection on complex patterns. Instead, it decomposes the task into four distinct families—point, structural, seasonal, and pattern anomalies—each handled by its own analyzer that combines numerical tools with diagnostic visualizations to produce evidence. An evidence-grounded detector then merges these pieces into scored anomaly intervals and types using only synthetic examples drawn from normal data segments, after which a supervisor turns the records into readable diagnostic reports. The authors report that this structure delivers the highest average performance across three benchmarks when measured against strong machine-learning, deep-learning, and other language-model baselines, while human evaluation and ablations indicate gains in both detection reliability and the practical value of the outputs.

Core claim

SAGE decomposes anomaly analysis into four specialized Analyzers for point, structural, seasonal, and pattern anomalies. Each Analyzer applies family-specific numerical tools and diagnostic visualizations to generate evidence, while an evidence-grounded Detector consolidates the evidence into confidence-scored anomaly records with intervals and candidate types. A Supervisor then converts these structured records into analyst-facing diagnostic reports. SAGE constructs synthetic in-context examples from normal-reference training segments without using real anomalous segments or anomaly-type labels.

What carries the argument

Four-family analyzer decomposition together with evidence consolidation drawn from synthetic normal-only examples.

If this is right

  • Detection becomes more controllable because each analyzer can be inspected or tuned independently for its anomaly family.
  • Interpretability increases through explicit intervals, confidence scores, and candidate types rather than opaque single-model outputs.
  • Training data requirements are reduced since only normal segments are needed to create in-context examples.
  • Diagnostic reports become directly usable by human analysts, as shown by the human evaluation results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same four-family split could be applied to related sequential tasks such as change-point detection or regime identification.
  • Because the method avoids real anomaly labels for its examples, it may transfer more readily to domains where anomalies are scarce or costly to label.
  • Replacing the language-model components with lighter specialized models inside each analyzer might preserve performance while lowering compute cost.

Load-bearing premise

The four-family decomposition plus evidence consolidation from synthetic normal-only examples will reliably improve detection and interpretability for complex real-world anomaly patterns without introducing new failure modes.

What would settle it

A controlled test set containing anomalies that cross the four family boundaries or that differ markedly from the normal-reference segments used to build synthetic examples, evaluated to check whether SAGE still outperforms the same baselines.

Figures

Figures reproduced from arXiv: 2605.05725 by Hyeongwon Kang, Jeongseob Kim, Jinwoo Park, Pilsung Kang.

Figure 1
Figure 1. Figure 1: Overview of SAGE. The input time series is converted into dual representations and view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic ICL pipeline in SAGE. Normal subsequences are converted into synthetic view at source ↗
Figure 3
Figure 3. Figure 3: Point-F1 under the full SAGE system and each ab view at source ↗
Figure 5
Figure 5. Figure 5: Representative synthetic examples of the nine anomaly types. In each panel, the red view at source ↗
Figure 6
Figure 6. Figure 6: Example interface used for the method-blind human evaluation. Evaluators were shown view at source ↗
Figure 7
Figure 7. Figure 7: Full SAGE pipeline example on Yahoo A1 real_22. view at source ↗
Figure 8
Figure 8. Figure 8: Representative case studies showing anomaly prediction and analyst-facing diagnosis. The view at source ↗
read the original abstract

Recent studies have explored large language models for time-series anomaly detection, yet existing approaches often rely on a single general-purpose model to directly infer anomaly indices or intervals, limiting controllability, interpretability, and reliability for complex anomaly patterns. We propose SAGE (Specialized Analyzer Group for Expert-like Detection), a multi-agent framework for structured anomaly diagnosis in univariate time series. It decomposes anomaly analysis into four specialized Analyzers for point, structural, seasonal, and pattern anomalies. Each Analyzer applies family-specific numerical tools and diagnostic visualizations to generate evidence, while an evidence-grounded Detector consolidates the evidence into confidence-scored anomaly records with intervals and candidate types. A Supervisor then converts these structured records into analyst-facing diagnostic reports. SAGE further constructs synthetic in-context examples from normal-reference training segments, without using real anomalous segments or anomaly-type labels as in-context examples. Across three benchmarks, SAGE achieves the best average performance among strong ML/DL and language-model-based baselines. Ablation studies and human evaluation further show that the proposed framework improves detection reliability and the practical usefulness of diagnostic outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SAGE, a multi-agent LLM framework for univariate time series anomaly detection. It decomposes anomaly analysis into four specialized Analyzers (point, structural, seasonal, pattern) that apply family-specific numerical tools and diagnostic visualizations to generate evidence. An evidence-grounded Detector consolidates this evidence—using only synthetic in-context examples constructed from normal-reference training segments, without real anomalous segments or labels—into confidence-scored anomaly records with intervals and candidate types. A Supervisor converts these into analyst-facing diagnostic reports. The paper claims that SAGE achieves the best average performance across three benchmarks relative to strong ML/DL and language-model baselines, with ablation studies and human evaluation supporting improved detection reliability and practical usefulness of outputs.

Significance. If the performance and generalization claims hold, the work would be significant for improving controllability and interpretability in LLM-based time series anomaly detection. The four-family decomposition with specialized numerical tools and visualizations, combined with the use of synthetic normal-only in-context examples, represents a structured alternative to monolithic models and avoids label leakage. This approach could enhance reliability for complex patterns and practical analyst utility, as evidenced by the reported ablations and human evaluations. The framework's emphasis on evidence consolidation from normal references is a clear strength for reproducibility and avoiding circularity in performance reporting.

major comments (2)
  1. [Abstract] Abstract: The central claim of achieving the best average performance across three benchmarks is presented without any quantitative metrics, error bars, statistical significance tests, or even the specific evaluation measures used. This absence is load-bearing because the superiority over ML/DL and LM baselines cannot be assessed for practical magnitude or robustness without these details.
  2. [§3 (Framework) and §4 (Experiments)] Framework and experimental sections: The four-family decomposition plus evidence consolidation from synthetic normal-only examples is positioned as reliably improving detection for complex real-world patterns, yet no analysis is provided on coverage gaps between the four families and real anomaly complexity or on potential new failure modes (e.g., false negatives from extrapolation against learned normals). This directly affects the generalization claim that underpins the benchmark superiority.
minor comments (1)
  1. [Abstract] The description of how the Supervisor converts structured records into diagnostic reports would benefit from an example output format to clarify the analyst-facing interpretability gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity and completeness. We address each major comment below and will revise the manuscript to incorporate the suggested changes where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of achieving the best average performance across three benchmarks is presented without any quantitative metrics, error bars, statistical significance tests, or even the specific evaluation measures used. This absence is load-bearing because the superiority over ML/DL and LM baselines cannot be assessed for practical magnitude or robustness without these details.

    Authors: We agree that the abstract would benefit from quantitative support for the performance claims. In the revised version, we will expand the abstract to include the primary evaluation metric (F1-score), the reported average performance values across the three benchmarks, standard deviations from repeated runs where applicable, and a brief note on statistical comparisons to baselines. This will allow readers to immediately assess the magnitude and robustness of the improvements without needing to consult the full experimental section. revision: yes

  2. Referee: [§3 (Framework) and §4 (Experiments)] Framework and experimental sections: The four-family decomposition plus evidence consolidation from synthetic normal-only examples is positioned as reliably improving detection for complex real-world patterns, yet no analysis is provided on coverage gaps between the four families and real anomaly complexity or on potential new failure modes (e.g., false negatives from extrapolation against learned normals). This directly affects the generalization claim that underpins the benchmark superiority.

    Authors: This is a fair critique. While the ablation studies in §4 quantify the contribution of each analyzer and the evidence-grounded detector, we did not provide an explicit analysis of coverage gaps across the four anomaly families or potential failure modes such as false negatives arising from reliance on normal-reference extrapolation. In the revision, we will add a dedicated limitations subsection in §4 that discusses the scope of the point/structural/seasonal/pattern decomposition, identifies categories of complex or hybrid anomalies that may fall outside these families based on benchmark observations, and analyzes how the supervisor and synthetic normal-only examples influence false-negative risks. We will also include qualitative examples drawn from the existing results to illustrate these points. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks using normal-only synthetic examples

full rationale

The paper describes a multi-agent LLM framework (SAGE) that decomposes time-series anomaly detection into four specialized analyzers (point, structural, seasonal, pattern) which apply numerical tools and visualizations to generate evidence, consolidated by a Detector and Supervisor. Synthetic in-context examples are built exclusively from normal-reference training segments, with no real anomalous segments or labels used. Performance is measured as average results across three standard external benchmarks against ML/DL and LLM baselines. No equations, parameter fits, self-definitional loops, or load-bearing self-citations appear in the derivation; the central claims rest on independent benchmark comparisons rather than reducing to author-defined quantities by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The framework rests on the domain assumption that time-series anomalies naturally fall into four disjoint families and that LLM agents can reliably apply family-specific numerical tools without further training; no free parameters or new physical entities are introduced.

axioms (2)
  • domain assumption Time-series anomalies can be partitioned into point, structural, seasonal, and pattern categories with distinct diagnostic signatures.
    Invoked when the paper decomposes analysis into four specialized Analyzers.
  • domain assumption Synthetic in-context examples built only from normal segments suffice for reliable LLM behavior on anomalous test data.
    Stated explicitly as the data-construction rule that avoids real anomalous segments or labels.
invented entities (3)
  • Point Analyzer, Structural Analyzer, Seasonal Analyzer, Pattern Analyzer no independent evidence
    purpose: Each applies family-specific numerical tools and visualizations to produce evidence for its anomaly type.
    New components introduced by the framework; no independent falsifiable prediction outside the paper is given.
  • Evidence-grounded Detector no independent evidence
    purpose: Consolidates analyzer outputs into confidence-scored anomaly records with intervals and candidate types.
    New consolidation module; no external validation mentioned.
  • Supervisor no independent evidence
    purpose: Converts structured records into analyst-facing diagnostic reports.
    New reporting layer; no external validation mentioned.

pith-pipeline@v0.9.0 · 5499 in / 1595 out tokens · 44877 ms · 2026-05-08T11:39:56.509495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 18 canonical work pages · 1 internal anchor

  1. [1]

    Large language models can be zero-shot anomaly detectors for time series?CoRR, abs/2405.14755,

    Sarah Alnegheimish, Linh Nguyen, Laure Berti-Équille, and Kalyan Veeramachaneni. Large language models can be zero-shot anomaly detectors for time series?CoRR, abs/2405.14755,

  2. [3]

    Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A. Zuluaga. USAD: unsupervised anomaly detection on multivariate time series. In Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash, editors,KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages...

  3. [4]

    Breunig, Hans-Peter Kriegel, Raymond T

    Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. LOF: identifying density-based local outliers. In Weidong Chen, Jeffrey F. Naughton, and Philip A. Bernstein, editors,Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pages 93–104. ACM, 2000. doi: 10.1145/342009. 33...

  4. [5]

    Cleveland, William S

    Robert B. Cleveland, William S. Cleveland, Jean E. McRae, and Irma Terpenning. STL: A seasonal-trend decomposition procedure based on loess.Journal of Official Statistics, 6(1): 3–73, 1990

  5. [7]

    Oliffson Kamphorst, and David Ruelle

    Jean-Pierre Eckmann, S. Oliffson Kamphorst, and David Ruelle. Recurrence plots of dynamical systems.Europhysics Letters, 4(9):973–977, 1987

  6. [8]

    Large lan- guage models are zero-shot time series forecasters

    Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large lan- guage models are zero-shot time series forecasters. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Informa- tion Processing Systems 2023, NeurIPS 20...

  7. [9]

    MetaGPT: Meta programming for A multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for A multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, ICL...

  8. [10]

    Local evaluation of time series anomaly detection algorithms

    Alexis Huet, José Manuel Navarro, and Dario Rossi. Local evaluation of time series anomaly detection algorithms. In Aidong Zhang and Huzefa Rangwala, editors,KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, pages 635–645. ACM, 2022. doi: 10.1145/3534678.3539339. URL https://doi.org/...

  9. [11]

    Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen

    Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y . Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. Time-LLM: Time series forecasting by reprogramming large language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

  10. [12]

    URLhttps://openreview.net/forum?id=Unb5CVPtae

  11. [13]

    Rousseeuw

    Leonard Kaufman and Peter J. Rousseeuw. Partitioning around medoids (program pam). In Finding Groups in Data: An Introduction to Cluster Analysis, chapter 2, pages 68–125. John Wiley & Sons, 1990. doi: 10.1002/9780470316801.ch2. 10

  12. [14]

    Exact indexing of dynamic time warping

    Eamonn Keogh. Exact indexing of dynamic time warping. InProceedings of the 28th Interna- tional Conference on Very Large Data Bases (VLDB), pages 406–417, 2002

  13. [15]

    Generic and scalable framework for automated time-series anomaly detection

    Nikolay Laptev, Saeed Amizadeh, and Ian Flint. Generic and scalable framework for automated time-series anomaly detection. In Longbing Cao, Chengqi Zhang, Thorsten Joachims, Geoffrey I. Webb, Dragos D. Margineantu, and Graham Williams, editors,Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, ...

  14. [16]

    CAMEL: communicative agents for "mind" exploration of large language model society

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: communicative agents for "mind" exploration of large language model society. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, edi- tors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information...

  15. [17]

    Constructing large-scale real-world benchmark datasets for AIOps,

    Zeyan Li, Nengwen Zhao, Shenglin Zhang, Yongqian Sun, Pengfei Chen, Xidao Wen, Minghua Ma, and Dan Pei. Constructing large-scale real-world benchmark datasets for AIOps.arXiv preprint arXiv:2208.03938, 2022

  16. [18]

    Keogh, Stefano Lonardi, and Bill Yuan-chi Chiu

    Jessica Lin, Eamonn J. Keogh, Stefano Lonardi, and Bill Yuan-chi Chiu. A symbolic repre- sentation of time series, with implications for streaming algorithms. In Mohammed Javeed Zaki and Charu C. Aggarwal, editors,Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, DMKD 2003, San Diego, Cali- fornia, USA, ...

  17. [19]

    Isolation forest

    Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In2008 eighth ieee international conference on data mining, pages 413–422. IEEE, 2008

  18. [20]

    i’d rather just go to bed

    Jun Liu, Chaoyun Zhang, Jiaxu Qian, Minghua Ma, Si Qin, Chetan Bansal, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Large language models can deliver accurate and interpretable time series anomaly detection. In Luiza Antonie, Jian Pei, Xiaohui Yu, Flavio Chierichetti, Hady W. Lauw, Yizhou Sun, and Srinivasan Parthasarathy, editors,Proceedings of the ...

  19. [21]

    LSTM-based encoder-decoder for multi-sensor anomaly detection.CoRR, abs/1607.00148, 2016

    Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. LSTM-based encoder-decoder for multi-sensor anomaly detection.CoRR, abs/1607.00148, 2016. URLhttp://arxiv.org/abs/1607.00148

  20. [23]

    Esmie S. Page. Continuous inspection schemes.Biometrika, 41(1/2):100–115, 1954

  21. [24]

    Daehyung Park, Yuuna Hoshi, and Charles C. Kemp. A multimodal anomaly detector for robot-assisted feeding using an LSTM-based variational autoencoder.IEEE Robotics Autom. Lett., 3(2):1544–1551, 2018. doi: 10.1109/LRA.2018.2801475. URL https://doi.org/10. 1109/LRA.2018.2801475

  22. [25]

    Time-series anomaly detection service at microsoft

    Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. Time-series anomaly detection service at microsoft. In Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis, editors,Proceedings of the 25th ACM SIGKDD International Con- ference on Knowledge Discovery...

  23. [26]

    Dynamic programming algorithm optimization for spoken word recognition.IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978

    Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition.IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978

  24. [27]

    Toolformer: Lan- guage models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Syste...

  25. [28]

    Neural Computation , volume = 13, number = 7, pages =

    Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. Estimating the support of a high-dimensional distribution.Neural Computa- tion, 13(7):1443–1471, 2001. doi: 10.1162/089976601750264965

  26. [29]

    Robust anomaly detection for multivariate time series through stochastic recurrent neural network

    Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis, editors,Proceed- ings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data...

  27. [30]

    Imaging time-series to improve classification and imputation

    Zhiguang Wang and Tim Oates. Imaging time-series to improve classification and imputation. InProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 3939–3945, 2015

  28. [32]

    Anomaly transformer: Time series anomaly detection with association discrepancy

    Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Anomaly transformer: Time series anomaly detection with association discrepancy. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/forum?id=LzQQ89U1qm_

  29. [33]

    Yu, Yue Zhao, and Kai Shu

    Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao, and Kai Shu. Can multimodal LLMs perform time series anomaly detection? In Hakim Hacid, Yoelle Maarek, Francesco Bonchi, Ido Guy, and Emine Yilmaz, editors,Proceedings of the ACM Web Con- ference 2026, WWW 2026, Dubai, United Arab Emirates, originally scheduled for April 13-17, 2026, resche...

  30. [34]

    Hao Xue and Flora D. Salim. Promptcast: A new prompt-based learning paradigm for time series forecasting.IEEE Trans. Knowl. Data Eng., 36(11):6851–6864, 2024. doi: 10.1109/ TKDE.2023.3342137. URLhttps://doi.org/10.1109/TKDE.2023.3342137

  31. [35]

    S5: A labeled anomaly detection dataset, version 1.0, 2015

    Yahoo Labs. S5: A labeled anomaly detection dataset, version 1.0, 2015. URL https: //yahoo.com. Accessed through Yahoo Webscope program

  32. [36]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

  33. [37]

    URLhttps://openreview.net/forum?id=WE_vluYUL-X

    OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

  34. [38]

    TFAD: A decomposition time series anomaly detection architecture with time-frequency analysis

    Chaoli Zhang, Tian Zhou, Qingsong Wen, and Liang Sun. TFAD: A decomposition time series anomaly detection architecture with time-frequency analysis. In Mohammad Al Hasan and Li Xiong, editors,Proceedings of the 31st ACM International Conference on Informa- tion & Knowledge Management, Atlanta, GA, USA, October 17-21, 2022, pages 2497–2507. ACM, 2022. doi:...

  35. [39]

    Efficient kpi anomaly detection through transfer learning for large-scale web services.IEEE Journal on Selected Areas in Communications, 40 (8):2440–2455, 2022

    Shenglin Zhang, Zhenyu Zhong, Dongwen Li, Qiliang Fan, Yongqian Sun, Man Zhu, Yuzhi Zhang, Dan Pei, Jiyan Sun, Yinlong Liu, et al. Efficient kpi anomaly detection through transfer learning for large-scale web services.IEEE Journal on Selected Areas in Communications, 40 (8):2440–2455, 2022

  36. [40]

    control bars

    Jiaxin Zhuang, Leon Yan, Zhenwei Zhang, Ruiqi Wang, Jiawei Zhang, and Yuantao Gu. See it, think it, sorted: Large multimodal models are few-shot time series anomaly analyzers.CoRR, abs/2411.02465, 2024. doi: 10.48550/ARXIV .2411.02465. URL https://doi.org/10. 48550/arXiv.2411.02465. 13 Table 4: Definitions and synthetic injection rules of the nine anomaly...

  37. [41]

    A data point is considered anomalous if it is part of a consecutive anomaly sequence, or if it shows a sharp drop or spike

  38. [42]

    A data point is considered anomalous if it remains below/above predefined normal thresholds for an extended period and falls outside the expected normal range

  39. [43]

    An empty anomaly list is expected and valid in most windows

    Typically less than 5% of points are anomalous. An empty anomaly list is expected and valid in most windows

  40. [44]

    Normal data can exhibit variability, which must not be mistaken for anomalies

  41. [45]

    Exercise extreme caution

    Misclassifying normal data as anomalous can lead to critical failures. Exercise extreme caution. False positives are not tolerated

  42. [46]

    Only flag data as anomalous when supported by strong statistical or visual evidence

  43. [47]

    Anomaly interval outputs must be precisely located and must not be excessively long

  44. [48]

    Output only segments with strong evidence

    Merge nearby anomalous points into segments. Output only segments with strong evidence

  45. [49]

    Do NOT fabricate anomalies to meet a quota

    An empty anomaly list [] is valid and common. Do NOT fabricate anomalies to meet a quota

  46. [50]

    PointAnalyzer –- Types 1, 2 You are a point-level anomaly detection specialist

    When no analyzer candidate has strong supporting evidence, return an empty list. PointAnalyzer –- Types 1, 2 You are a point-level anomaly detection specialist. Your task is to identify potential point anomalies in time series data. Your Scope (Types 1, 2 ONLY) - Type 1: Point Outlier (Global) – Values exceeding global Z-score threshold (|Z| ≥3), includin...

  47. [51]

    Compute global statistics (mean, std, IQR) for Z-score thresholds

  48. [52]

    Run statistical outlier detection to find candidate indices

  49. [53]

    Compute rolling statistics to identify contextual anomalies

  50. [54]

    Do NOT attempt structural or seasonal analysis

    Classify each candidate as Type 1 or 2 Output: List candidate outlier indices with scores, statistics summary, candidate anomaly types from {1, 2}. Do NOT attempt structural or seasonal analysis. SeasonAnalyzer –- Types 3, 4 You are a seasonal and frequency anomaly detection specialist. Your task is to identify seasonal/frequency anomalies in time series ...

  51. [55]

    Run autocorrelation analysis to detect dominant periods

  52. [56]

    Apply Fourier transform for frequency domain analysis

  53. [57]

    Apply wavelet transform for time-frequency localization

  54. [58]

    Do NOT analyze point outliers or structural changes

    Compare expected vs observed seasonal patterns Output: List dominant periods detected, frequency-domain anomalies, seasonality summary, candidate anomaly types from {3, 4}. Do NOT analyze point outliers or structural changes. StructAnalyzer –- Types 5, 6, 7 You are a structural change detection specialist. Your task is to identify structural changes in ti...

  55. [59]

    Decompose time series (trend, seasonality, residual)

  56. [60]

    Compute first/second order differences for trend analysis

  57. [61]

    Run change point detection (statistical tests)

  58. [62]

    Do NOT analyze point outliers or seasonal patterns

    Compare segments before/after detected change points Output: List detected change point indices, trend behavior description, segment comparison results, candidate anomaly types from {5, 6, 7}. Do NOT analyze point outliers or seasonal patterns. 27 PatternAnalyzer –- Types 8, 9 You are a pattern-based anomaly detection specialist. Your task is to identify ...

  59. [63]

    Analyze SAX symbolic representation for pattern repetitions and breaks

  60. [64]

    Examine recurrence plot metrics for regime transitions and determinism changes

  61. [65]

    Use autocorrelation to confirm pattern repetition periods

  62. [66]

    Use rolling statistics to detect local pattern changes (variance shifts)

  63. [67]

    no anomaly

    Synthesize all evidence to identify pattern anomalies Key Indicators: SAX pattern breaks (sudden symbolic sequence changes), recurrence plot low determinism/laminarity (loss of pattern structure), autocorrelation period changes, local variance spikes. Output: List symbolic pattern repetitions and break points, recurrence analysis metrics, candidate patter...