arxiv: 2605.05725 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers

Hyeongwon Kang , Jeongseob Kim , Jinwoo Park , Pilsung Kang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:39 UTC · model grok-4.3

classification 💻 cs.AI

keywords time series anomaly detectionmulti-agent LLMspecialized analyzersunivariate time seriesevidence consolidationinterpretabilitysynthetic in-context examples

0 comments

The pith

A multi-agent framework breaks time series anomaly detection into four specialized families to raise both accuracy and interpretability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a single general-purpose language model is not enough for reliable univariate time series anomaly detection on complex patterns. Instead, it decomposes the task into four distinct families—point, structural, seasonal, and pattern anomalies—each handled by its own analyzer that combines numerical tools with diagnostic visualizations to produce evidence. An evidence-grounded detector then merges these pieces into scored anomaly intervals and types using only synthetic examples drawn from normal data segments, after which a supervisor turns the records into readable diagnostic reports. The authors report that this structure delivers the highest average performance across three benchmarks when measured against strong machine-learning, deep-learning, and other language-model baselines, while human evaluation and ablations indicate gains in both detection reliability and the practical value of the outputs.

Core claim

SAGE decomposes anomaly analysis into four specialized Analyzers for point, structural, seasonal, and pattern anomalies. Each Analyzer applies family-specific numerical tools and diagnostic visualizations to generate evidence, while an evidence-grounded Detector consolidates the evidence into confidence-scored anomaly records with intervals and candidate types. A Supervisor then converts these structured records into analyst-facing diagnostic reports. SAGE constructs synthetic in-context examples from normal-reference training segments without using real anomalous segments or anomaly-type labels.

What carries the argument

Four-family analyzer decomposition together with evidence consolidation drawn from synthetic normal-only examples.

If this is right

Detection becomes more controllable because each analyzer can be inspected or tuned independently for its anomaly family.
Interpretability increases through explicit intervals, confidence scores, and candidate types rather than opaque single-model outputs.
Training data requirements are reduced since only normal segments are needed to create in-context examples.
Diagnostic reports become directly usable by human analysts, as shown by the human evaluation results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same four-family split could be applied to related sequential tasks such as change-point detection or regime identification.
Because the method avoids real anomaly labels for its examples, it may transfer more readily to domains where anomalies are scarce or costly to label.
Replacing the language-model components with lighter specialized models inside each analyzer might preserve performance while lowering compute cost.

Load-bearing premise

The four-family decomposition plus evidence consolidation from synthetic normal-only examples will reliably improve detection and interpretability for complex real-world anomaly patterns without introducing new failure modes.

What would settle it

A controlled test set containing anomalies that cross the four family boundaries or that differ markedly from the normal-reference segments used to build synthetic examples, evaluated to check whether SAGE still outperforms the same baselines.

Figures

Figures reproduced from arXiv: 2605.05725 by Hyeongwon Kang, Jeongseob Kim, Jinwoo Park, Pilsung Kang.

**Figure 1.** Figure 1: Overview of SAGE. The input time series is converted into dual representations and view at source ↗

**Figure 2.** Figure 2: Synthetic ICL pipeline in SAGE. Normal subsequences are converted into synthetic view at source ↗

**Figure 3.** Figure 3: Point-F1 under the full SAGE system and each ab view at source ↗

**Figure 5.** Figure 5: Representative synthetic examples of the nine anomaly types. In each panel, the red view at source ↗

**Figure 6.** Figure 6: Example interface used for the method-blind human evaluation. Evaluators were shown view at source ↗

**Figure 7.** Figure 7: Full SAGE pipeline example on Yahoo A1 real_22. view at source ↗

**Figure 8.** Figure 8: Representative case studies showing anomaly prediction and analyst-facing diagnosis. The view at source ↗

read the original abstract

Recent studies have explored large language models for time-series anomaly detection, yet existing approaches often rely on a single general-purpose model to directly infer anomaly indices or intervals, limiting controllability, interpretability, and reliability for complex anomaly patterns. We propose SAGE (Specialized Analyzer Group for Expert-like Detection), a multi-agent framework for structured anomaly diagnosis in univariate time series. It decomposes anomaly analysis into four specialized Analyzers for point, structural, seasonal, and pattern anomalies. Each Analyzer applies family-specific numerical tools and diagnostic visualizations to generate evidence, while an evidence-grounded Detector consolidates the evidence into confidence-scored anomaly records with intervals and candidate types. A Supervisor then converts these structured records into analyst-facing diagnostic reports. SAGE further constructs synthetic in-context examples from normal-reference training segments, without using real anomalous segments or anomaly-type labels as in-context examples. Across three benchmarks, SAGE achieves the best average performance among strong ML/DL and language-model-based baselines. Ablation studies and human evaluation further show that the proposed framework improves detection reliability and the practical usefulness of diagnostic outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE's four-analyzer split plus normal-only synthetic examples is a clean extension of prior LLM anomaly work, but the performance edge is still unproven without the actual numbers and tool details.

read the letter

The main takeaway is that this paper describes SAGE, a multi-agent setup that assigns separate LLM agents to point, structural, seasonal, and pattern anomalies in univariate series. Each agent runs family-specific numerical checks and visualizations to produce evidence, an evidence-grounded detector merges the pieces into scored intervals and types, and a supervisor turns that into readable reports. They build the in-context examples exclusively from normal training segments, avoiding any real anomalies or labels in the prompts. That combination is the clearest new piece relative to the single-model LLM baselines they cite. It gives a more modular way to handle varied anomaly shapes and keeps the prompting grounded, which should help with interpretability and reduce some forms of leakage. The human evaluation angle also makes sense for checking whether the outputs are actually usable by analysts. The soft spots are straightforward. The abstract claims best average performance on three benchmarks and cites supporting ablations, yet supplies no scores, no error bars, and no concrete description of the numerical tools inside each analyzer. Without those, it is difficult to judge how much the decomposition actually moves the needle versus careful prompt design. The concern in the stress-test note holds until we see the full results: if the four families miss certain real-world patterns or the LLM misreads the evidence, the averages could mask false negatives. The framework itself looks internally consistent and the citation pattern to earlier LLM time-series papers is appropriate. This is for applied researchers and engineers working on industrial monitoring who want a more structured LLM option than direct prompting. A reader focused on anomaly detection methods would get value from the architecture and the normal-only example construction once the quantitative details are filled in. It deserves peer review because the idea is a targeted, reproducible extension that referees can test directly against the claimed baselines.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SAGE, a multi-agent LLM framework for univariate time series anomaly detection. It decomposes anomaly analysis into four specialized Analyzers (point, structural, seasonal, pattern) that apply family-specific numerical tools and diagnostic visualizations to generate evidence. An evidence-grounded Detector consolidates this evidence—using only synthetic in-context examples constructed from normal-reference training segments, without real anomalous segments or labels—into confidence-scored anomaly records with intervals and candidate types. A Supervisor converts these into analyst-facing diagnostic reports. The paper claims that SAGE achieves the best average performance across three benchmarks relative to strong ML/DL and language-model baselines, with ablation studies and human evaluation supporting improved detection reliability and practical usefulness of outputs.

Significance. If the performance and generalization claims hold, the work would be significant for improving controllability and interpretability in LLM-based time series anomaly detection. The four-family decomposition with specialized numerical tools and visualizations, combined with the use of synthetic normal-only in-context examples, represents a structured alternative to monolithic models and avoids label leakage. This approach could enhance reliability for complex patterns and practical analyst utility, as evidenced by the reported ablations and human evaluations. The framework's emphasis on evidence consolidation from normal references is a clear strength for reproducibility and avoiding circularity in performance reporting.

major comments (2)

[Abstract] Abstract: The central claim of achieving the best average performance across three benchmarks is presented without any quantitative metrics, error bars, statistical significance tests, or even the specific evaluation measures used. This absence is load-bearing because the superiority over ML/DL and LM baselines cannot be assessed for practical magnitude or robustness without these details.
[§3 (Framework) and §4 (Experiments)] Framework and experimental sections: The four-family decomposition plus evidence consolidation from synthetic normal-only examples is positioned as reliably improving detection for complex real-world patterns, yet no analysis is provided on coverage gaps between the four families and real anomaly complexity or on potential new failure modes (e.g., false negatives from extrapolation against learned normals). This directly affects the generalization claim that underpins the benchmark superiority.

minor comments (1)

[Abstract] The description of how the Supervisor converts structured records into diagnostic reports would benefit from an example output format to clarify the analyst-facing interpretability gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity and completeness. We address each major comment below and will revise the manuscript to incorporate the suggested changes where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of achieving the best average performance across three benchmarks is presented without any quantitative metrics, error bars, statistical significance tests, or even the specific evaluation measures used. This absence is load-bearing because the superiority over ML/DL and LM baselines cannot be assessed for practical magnitude or robustness without these details.

Authors: We agree that the abstract would benefit from quantitative support for the performance claims. In the revised version, we will expand the abstract to include the primary evaluation metric (F1-score), the reported average performance values across the three benchmarks, standard deviations from repeated runs where applicable, and a brief note on statistical comparisons to baselines. This will allow readers to immediately assess the magnitude and robustness of the improvements without needing to consult the full experimental section. revision: yes
Referee: [§3 (Framework) and §4 (Experiments)] Framework and experimental sections: The four-family decomposition plus evidence consolidation from synthetic normal-only examples is positioned as reliably improving detection for complex real-world patterns, yet no analysis is provided on coverage gaps between the four families and real anomaly complexity or on potential new failure modes (e.g., false negatives from extrapolation against learned normals). This directly affects the generalization claim that underpins the benchmark superiority.

Authors: This is a fair critique. While the ablation studies in §4 quantify the contribution of each analyzer and the evidence-grounded detector, we did not provide an explicit analysis of coverage gaps across the four anomaly families or potential failure modes such as false negatives arising from reliance on normal-reference extrapolation. In the revision, we will add a dedicated limitations subsection in §4 that discusses the scope of the point/structural/seasonal/pattern decomposition, identifies categories of complex or hybrid anomalies that may fall outside these families based on benchmark observations, and analyzes how the supervisor and synthetic normal-only examples influence false-negative risks. We will also include qualitative examples drawn from the existing results to illustrate these points. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks using normal-only synthetic examples

full rationale

The paper describes a multi-agent LLM framework (SAGE) that decomposes time-series anomaly detection into four specialized analyzers (point, structural, seasonal, pattern) which apply numerical tools and visualizations to generate evidence, consolidated by a Detector and Supervisor. Synthetic in-context examples are built exclusively from normal-reference training segments, with no real anomalous segments or labels used. Performance is measured as average results across three standard external benchmarks against ML/DL and LLM baselines. No equations, parameter fits, self-definitional loops, or load-bearing self-citations appear in the derivation; the central claims rest on independent benchmark comparisons rather than reducing to author-defined quantities by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The framework rests on the domain assumption that time-series anomalies naturally fall into four disjoint families and that LLM agents can reliably apply family-specific numerical tools without further training; no free parameters or new physical entities are introduced.

axioms (2)

domain assumption Time-series anomalies can be partitioned into point, structural, seasonal, and pattern categories with distinct diagnostic signatures.
Invoked when the paper decomposes analysis into four specialized Analyzers.
domain assumption Synthetic in-context examples built only from normal segments suffice for reliable LLM behavior on anomalous test data.
Stated explicitly as the data-construction rule that avoids real anomalous segments or labels.

invented entities (3)

Point Analyzer, Structural Analyzer, Seasonal Analyzer, Pattern Analyzer no independent evidence
purpose: Each applies family-specific numerical tools and visualizations to produce evidence for its anomaly type.
New components introduced by the framework; no independent falsifiable prediction outside the paper is given.
Evidence-grounded Detector no independent evidence
purpose: Consolidates analyzer outputs into confidence-scored anomaly records with intervals and candidate types.
New consolidation module; no external validation mentioned.
Supervisor no independent evidence
purpose: Converts structured records into analyst-facing diagnostic reports.
New reporting layer; no external validation mentioned.

pith-pipeline@v0.9.0 · 5499 in / 1595 out tokens · 44877 ms · 2026-05-08T11:39:56.509495+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Large language models can be zero-shot anomaly detectors for time series?CoRR, abs/2405.14755,

Sarah Alnegheimish, Linh Nguyen, Laure Berti-Équille, and Kalyan Veeramachaneni. Large language models can be zero-shot anomaly detectors for time series?CoRR, abs/2405.14755,

work page arXiv
[3]

Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A. Zuluaga. USAD: unsupervised anomaly detection on multivariate time series. In Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash, editors,KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages...

work page doi:10.1145/3394486.3403392 2020
[4]

Breunig, Hans-Peter Kriegel, Raymond T

Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. LOF: identifying density-based local outliers. In Weidong Chen, Jeffrey F. Naughton, and Philip A. Bernstein, editors,Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pages 93–104. ACM, 2000. doi: 10.1145/342009. 33...

work page doi:10.1145/342009 2000
[5]

Cleveland, William S

Robert B. Cleveland, William S. Cleveland, Jean E. McRae, and Irma Terpenning. STL: A seasonal-trend decomposition procedure based on loess.Journal of Official Statistics, 6(1): 3–73, 1990

1990
[7]

Oliffson Kamphorst, and David Ruelle

Jean-Pierre Eckmann, S. Oliffson Kamphorst, and David Ruelle. Recurrence plots of dynamical systems.Europhysics Letters, 4(9):973–977, 1987

1987
[8]

Large lan- guage models are zero-shot time series forecasters

Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large lan- guage models are zero-shot time series forecasters. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Informa- tion Processing Systems 2023, NeurIPS 20...

2023
[9]

MetaGPT: Meta programming for A multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for A multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, ICL...

2024
[10]

Local evaluation of time series anomaly detection algorithms

Alexis Huet, José Manuel Navarro, and Dario Rossi. Local evaluation of time series anomaly detection algorithms. In Aidong Zhang and Huzefa Rangwala, editors,KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, pages 635–645. ACM, 2022. doi: 10.1145/3534678.3539339. URL https://doi.org/...

work page doi:10.1145/3534678.3539339 2022
[11]

Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y . Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. Time-LLM: Time series forecasting by reprogramming large language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

2024
[12]

URLhttps://openreview.net/forum?id=Unb5CVPtae
[13]

Rousseeuw

Leonard Kaufman and Peter J. Rousseeuw. Partitioning around medoids (program pam). In Finding Groups in Data: An Introduction to Cluster Analysis, chapter 2, pages 68–125. John Wiley & Sons, 1990. doi: 10.1002/9780470316801.ch2. 10

work page doi:10.1002/9780470316801.ch2 1990
[14]

Exact indexing of dynamic time warping

Eamonn Keogh. Exact indexing of dynamic time warping. InProceedings of the 28th Interna- tional Conference on Very Large Data Bases (VLDB), pages 406–417, 2002

2002
[15]

Generic and scalable framework for automated time-series anomaly detection

Nikolay Laptev, Saeed Amizadeh, and Ian Flint. Generic and scalable framework for automated time-series anomaly detection. In Longbing Cao, Chengqi Zhang, Thorsten Joachims, Geoffrey I. Webb, Dragos D. Margineantu, and Graham Williams, editors,Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, ...

work page doi:10.1145/2783258.2788611 2015
[16]

CAMEL: communicative agents for "mind" exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: communicative agents for "mind" exploration of large language model society. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, edi- tors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information...

2023
[17]

Constructing large-scale real-world benchmark datasets for AIOps,

Zeyan Li, Nengwen Zhao, Shenglin Zhang, Yongqian Sun, Pengfei Chen, Xidao Wen, Minghua Ma, and Dan Pei. Constructing large-scale real-world benchmark datasets for AIOps.arXiv preprint arXiv:2208.03938, 2022

work page arXiv 2022
[18]

Keogh, Stefano Lonardi, and Bill Yuan-chi Chiu

Jessica Lin, Eamonn J. Keogh, Stefano Lonardi, and Bill Yuan-chi Chiu. A symbolic repre- sentation of time series, with implications for streaming algorithms. In Mohammed Javeed Zaki and Charu C. Aggarwal, editors,Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, DMKD 2003, San Diego, Cali- fornia, USA, ...

work page doi:10.1145/882082.882086 2003
[19]

Isolation forest

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In2008 eighth ieee international conference on data mining, pages 413–422. IEEE, 2008

2008
[20]

i’d rather just go to bed

Jun Liu, Chaoyun Zhang, Jiaxu Qian, Minghua Ma, Si Qin, Chetan Bansal, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Large language models can deliver accurate and interpretable time series anomaly detection. In Luiza Antonie, Jian Pei, Xiaohui Yu, Flavio Chierichetti, Hady W. Lauw, Yizhou Sun, and Srinivasan Parthasarathy, editors,Proceedings of the ...

work page doi:10.1145/3711896 2025
[21]

LSTM-based encoder-decoder for multi-sensor anomaly detection.CoRR, abs/1607.00148, 2016

Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. LSTM-based encoder-decoder for multi-sensor anomaly detection.CoRR, abs/1607.00148, 2016. URLhttp://arxiv.org/abs/1607.00148

work page arXiv 2016
[23]

Esmie S. Page. Continuous inspection schemes.Biometrika, 41(1/2):100–115, 1954

1954
[24]

Daehyung Park, Yuuna Hoshi, and Charles C. Kemp. A multimodal anomaly detector for robot-assisted feeding using an LSTM-based variational autoencoder.IEEE Robotics Autom. Lett., 3(2):1544–1551, 2018. doi: 10.1109/LRA.2018.2801475. URL https://doi.org/10. 1109/LRA.2018.2801475

work page doi:10.1109/lra.2018.2801475 2018
[25]

Time-series anomaly detection service at microsoft

Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. Time-series anomaly detection service at microsoft. In Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis, editors,Proceedings of the 25th ACM SIGKDD International Con- ference on Knowledge Discovery...

work page doi:10.1145/3292500.3330680 2019
[26]

Dynamic programming algorithm optimization for spoken word recognition.IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978

Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition.IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978

1978
[27]

Toolformer: Lan- guage models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Syste...

2023
[28]

Neural Computation , volume = 13, number = 7, pages =

Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. Estimating the support of a high-dimensional distribution.Neural Computa- tion, 13(7):1443–1471, 2001. doi: 10.1162/089976601750264965

work page doi:10.1162/089976601750264965 2001
[29]

Robust anomaly detection for multivariate time series through stochastic recurrent neural network

Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis, editors,Proceed- ings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data...

work page doi:10.1145/3292500.3330672 2019
[30]

Imaging time-series to improve classification and imputation

Zhiguang Wang and Tim Oates. Imaging time-series to improve classification and imputation. InProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 3939–3945, 2015

2015
[32]

Anomaly transformer: Time series anomaly detection with association discrepancy

Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Anomaly transformer: Time series anomaly detection with association discrepancy. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/forum?id=LzQQ89U1qm_

2022
[33]

Yu, Yue Zhao, and Kai Shu

Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao, and Kai Shu. Can multimodal LLMs perform time series anomaly detection? In Hakim Hacid, Yoelle Maarek, Francesco Bonchi, Ido Guy, and Emine Yilmaz, editors,Proceedings of the ACM Web Con- ference 2026, WWW 2026, Dubai, United Arab Emirates, originally scheduled for April 13-17, 2026, resche...

work page doi:10.1145/3774904.3792376 2026
[34]

Hao Xue and Flora D. Salim. Promptcast: A new prompt-based learning paradigm for time series forecasting.IEEE Trans. Knowl. Data Eng., 36(11):6851–6864, 2024. doi: 10.1109/ TKDE.2023.3342137. URLhttps://doi.org/10.1109/TKDE.2023.3342137

work page doi:10.1109/tkde.2023.3342137 2024
[35]

S5: A labeled anomaly detection dataset, version 1.0, 2015

Yahoo Labs. S5: A labeled anomaly detection dataset, version 1.0, 2015. URL https: //yahoo.com. Accessed through Yahoo Webscope program

2015
[36]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

2023
[37]

URLhttps://openreview.net/forum?id=WE_vluYUL-X

OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

2023
[38]

TFAD: A decomposition time series anomaly detection architecture with time-frequency analysis

Chaoli Zhang, Tian Zhou, Qingsong Wen, and Liang Sun. TFAD: A decomposition time series anomaly detection architecture with time-frequency analysis. In Mohammad Al Hasan and Li Xiong, editors,Proceedings of the 31st ACM International Conference on Informa- tion & Knowledge Management, Atlanta, GA, USA, October 17-21, 2022, pages 2497–2507. ACM, 2022. doi:...

work page doi:10.1145/3511808.3557470 2022
[39]

Efficient kpi anomaly detection through transfer learning for large-scale web services.IEEE Journal on Selected Areas in Communications, 40 (8):2440–2455, 2022

Shenglin Zhang, Zhenyu Zhong, Dongwen Li, Qiliang Fan, Yongqian Sun, Man Zhu, Yuzhi Zhang, Dan Pei, Jiyan Sun, Yinlong Liu, et al. Efficient kpi anomaly detection through transfer learning for large-scale web services.IEEE Journal on Selected Areas in Communications, 40 (8):2440–2455, 2022

2022
[40]

control bars

Jiaxin Zhuang, Leon Yan, Zhenwei Zhang, Ruiqi Wang, Jiawei Zhang, and Yuantao Gu. See it, think it, sorted: Large multimodal models are few-shot time series anomaly analyzers.CoRR, abs/2411.02465, 2024. doi: 10.48550/ARXIV .2411.02465. URL https://doi.org/10. 48550/arXiv.2411.02465. 13 Table 4: Definitions and synthetic injection rules of the nine anomaly...

work page internal anchor Pith review doi:10.48550/arxiv 2024
[41]

A data point is considered anomalous if it is part of a consecutive anomaly sequence, or if it shows a sharp drop or spike
[42]

A data point is considered anomalous if it remains below/above predefined normal thresholds for an extended period and falls outside the expected normal range
[43]

An empty anomaly list is expected and valid in most windows

Typically less than 5% of points are anomalous. An empty anomaly list is expected and valid in most windows
[44]

Normal data can exhibit variability, which must not be mistaken for anomalies
[45]

Exercise extreme caution

Misclassifying normal data as anomalous can lead to critical failures. Exercise extreme caution. False positives are not tolerated
[46]

Only flag data as anomalous when supported by strong statistical or visual evidence
[47]

Anomaly interval outputs must be precisely located and must not be excessively long
[48]

Output only segments with strong evidence

Merge nearby anomalous points into segments. Output only segments with strong evidence
[49]

Do NOT fabricate anomalies to meet a quota

An empty anomaly list [] is valid and common. Do NOT fabricate anomalies to meet a quota
[50]

PointAnalyzer –- Types 1, 2 You are a point-level anomaly detection specialist

When no analyzer candidate has strong supporting evidence, return an empty list. PointAnalyzer –- Types 1, 2 You are a point-level anomaly detection specialist. Your task is to identify potential point anomalies in time series data. Your Scope (Types 1, 2 ONLY) - Type 1: Point Outlier (Global) – Values exceeding global Z-score threshold (|Z| ≥3), includin...
[51]

Compute global statistics (mean, std, IQR) for Z-score thresholds
[52]

Run statistical outlier detection to find candidate indices
[53]

Compute rolling statistics to identify contextual anomalies
[54]

Do NOT attempt structural or seasonal analysis

Classify each candidate as Type 1 or 2 Output: List candidate outlier indices with scores, statistics summary, candidate anomaly types from {1, 2}. Do NOT attempt structural or seasonal analysis. SeasonAnalyzer –- Types 3, 4 You are a seasonal and frequency anomaly detection specialist. Your task is to identify seasonal/frequency anomalies in time series ...
[55]

Run autocorrelation analysis to detect dominant periods
[56]

Apply Fourier transform for frequency domain analysis
[57]

Apply wavelet transform for time-frequency localization
[58]

Do NOT analyze point outliers or structural changes

Compare expected vs observed seasonal patterns Output: List dominant periods detected, frequency-domain anomalies, seasonality summary, candidate anomaly types from {3, 4}. Do NOT analyze point outliers or structural changes. StructAnalyzer –- Types 5, 6, 7 You are a structural change detection specialist. Your task is to identify structural changes in ti...
[59]

Decompose time series (trend, seasonality, residual)
[60]

Compute first/second order differences for trend analysis
[61]

Run change point detection (statistical tests)
[62]

Do NOT analyze point outliers or seasonal patterns

Compare segments before/after detected change points Output: List detected change point indices, trend behavior description, segment comparison results, candidate anomaly types from {5, 6, 7}. Do NOT analyze point outliers or seasonal patterns. 27 PatternAnalyzer –- Types 8, 9 You are a pattern-based anomaly detection specialist. Your task is to identify ...
[63]

Analyze SAX symbolic representation for pattern repetitions and breaks
[64]

Examine recurrence plot metrics for regime transitions and determinism changes
[65]

Use autocorrelation to confirm pattern repetition periods
[66]

Use rolling statistics to detect local pattern changes (variance shifts)
[67]

no anomaly

Synthesize all evidence to identify pattern anomalies Key Indicators: SAX pattern breaks (sudden symbolic sequence changes), recurrence plot low determinism/laminarity (loss of pattern structure), autocorrelation period changes, local variance spikes. Output: List symbolic pattern repetitions and break points, recurrence analysis metrics, candidate patter...