ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

Ameet Talwalkar; Ben Cohen; Chenghao Liu; David Asker; Emaad Khwaja; Junhong Shen; Mononito Goswami; Othmane Abou-Amal; Stephan Xie

arxiv: 2604.21199 · v2 · submitted 2026-04-23 · 💻 cs.LG · cs.CV

ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

Stephan Xie , Ben Cohen , Mononito Goswami , Junhong Shen , Emaad Khwaja , Chenghao Liu , David Asker , Othmane Abou-Amal

show 1 more author

Ameet Talwalkar

This is my paper

Pith reviewed 2026-05-09 22:04 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords time series question answeringbenchmarksoftware incident responsevision language modelsfoundation modelsanomaly reasoninghybrid modelsmodel expert combination

0 comments

The pith

A benchmark of 750 questions on real software incident time series shows frontier vision-language models reaching 63 percent accuracy while a hybrid model matches them and expert combinations reach 87 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark to measure how well foundation models can answer natural language questions about time series data drawn from software incidents. It tests leading language, vision-language, and time series models and reports that vision-language models lead the pack, with the strongest one scoring 62.7 percent accuracy and 51.9 percent F1. The authors also build a hybrid system that combines a time series foundation model with a vision-language model and show that light additional training lets it reach similar overall scores. Finally the work defines an oracle that picks the better answer from either the model or a human expert and records 87.2 percent accuracy, revealing clear complementary strengths between automated and human reasoning on the same data.

Core claim

ARFBench consists of 750 questions over 142 time series drawn from 63 production incidents. When evaluated on this benchmark, frontier VLMs achieve a leading 62.7 percent accuracy and 51.9 percent F1 score. A novel hybrid prototype combining a time series foundation model with a vision-language model, after post-training on limited synthetic and real data, reaches comparable performance. A best-of-two oracle that selects between model and expert answers attains 82.8 percent F1 and 87.2 percent accuracy.

What carries the argument

ARFBench, the collection of 750 natural language questions paired with 142 time series from software incidents, serves as the evaluation suite; the TSFM plus VLM hybrid prototype provides the specialized multimodal approach; and the model-expert oracle selector demonstrates the upper bound by combining automated and human responses.

If this is right

Specialized post-training on time series plots and visual representations can bring smaller models up to the performance level of much larger general-purpose models for this task.
Human experts and current models each miss different kinds of patterns in incident data, so systems that route between them can reach substantially higher reliability than either alone.
Future model development for operations monitoring should target the performance gap between single models and the model-expert oracle rather than raw scale.
Production observability platforms could embed these models to accelerate root-cause analysis during live incidents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Converting time series into visual plots appears to unlock reasoning capabilities in vision-language models that pure numerical or text-only models currently lack.
The same benchmark construction could be repeated in other high-stakes time series domains such as energy grid monitoring or clinical vital signs to test whether the same model rankings hold.
The remaining gap to the oracle suggests that progress will depend more on reliable ways to inject domain context than on further increases in model size.
If the questions in the benchmark capture the core skills operators need, then sustained improvement on it would translate directly into measurable reductions in mean time to resolution for incidents.

Load-bearing premise

The 750 questions and 142 time series from 63 incidents are representative of the time series question-answering skills needed for real software incident response without significant selection bias in question design or incident sampling.

What would settle it

A new, independently collected set of several hundred time series questions from different production systems on which the same models produce markedly lower accuracy or on which the hybrid model no longer matches frontier performance would show the benchmark results do not generalize.

Figures

Figures reproduced from arXiv: 2604.21199 by Ameet Talwalkar, Ben Cohen, Chenghao Liu, David Asker, Emaad Khwaja, Junhong Shen, Mononito Goswami, Othmane Abou-Amal, Stephan Xie.

**Figure 1.** Figure 1: ARFBench consists of 750 question-answer (QA) pairs, derived from 63 real-world incidents and 142 observability time series. Observability time series are highly nonstationary and complex (Cohen et al., 2025), and ARFBench includes highly multivariate series that challenge LLM/VLM input representations. A. Workflow of ARFBench question-answer generation. Engineers use commercial messaging platforms to resp… view at source ↗

**Figure 2.** Figure 2: Example questions in ARFBench for each tier. For each question, the model is given a time [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: ARFBench requires multivariate reasoning. A single time series variate taken out of context [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture diagram of the Toto-1.0-QA-Experimental [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Top: Example metric query used to query the time series database. The blue text represents the metric name. The green text represents the filters. The red text represents space aggregation functions and group-by keys. Finally, the purple text represents a time aggregation function. Bottom: Example summarized time series description of the metric query above. A.2 Software Incident Timelines In Context of In… view at source ↗

**Figure 6.** Figure 6: Distribution of question categories (left) and time series domains (right) in ARFBench. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Process of generating synthetic multivariate data for post-training VLMs, TSFM-VLMs, and [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: An example where ChatTS does not utilize the extra textual context given in the question to [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: An example where GPT-4.1 (VLM) lacks the domain knowledge and perceptual ability to notice [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: Gemini 3 Pro correctly perceives the graph, but reverses the cause and effect, which is a lack [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: Toto-1.0-Qwen3 32B does not correctly perceive the time series, missing the changing trend in [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Top models fail to use sufficient domain knowledge, in contrast to domain experts. [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: Experts do not always perceive fine-grained details about exact timings of when anomalies [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗

**Figure 14.** Figure 14: Experts also make some understanding and/or instruction-following errors which models often [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗

**Figure 15.** Figure 15: The model reasons incorrectly, likely due to a lack of contextualization with the time series [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

read the original abstract

Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models. In this work, we present ARFBench, a TSQA benchmark that evaluates the understanding of multimodal foundation models (FMs) on time series anomalies prevalent in software incident data. ARFBench consists of 750 questions across 142 time series and 5.38M data points from 63 production incidents sourced exclusively from internal telemetry at Datadog. We evaluate leading proprietary and open-source LLMs, VLMs, and time series FMs and observe that frontier VLMs perform markedly better than existing baselines; the leading model (GPT-5) achieves a 62.7% accuracy and 51.9% F1. We next demonstrate the promise of specialized multimodal approaches. We develop a novel TSFM + VLM hybrid prototype which we post-train on a small set of synthetic and real data that yields comparable overall F1 and accuracy with frontier models. Lastly, we find models and human domain experts exhibit complementary strengths. We define a model-expert oracle, a best-of-2 oracle selector over model and expert answers, yielding 82.8% F1 and 87.2% accuracy and establishing a new superhuman frontier for future TSQA models. The benchmark is available at https://huggingface.co/datasets/Datadog/ARFBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARFBench supplies a new public dataset for time series QA on real incidents, but single-company sourcing and thin details on question design make the VLM and hybrid results hard to generalize.

read the letter

The paper's core move is releasing ARFBench: 750 questions on 142 time series drawn from 63 Datadog production incidents, plus evaluations showing GPT-5 at 62.7% accuracy / 51.9% F1, a post-trained TSFM+VLM hybrid staying competitive, and a model-expert oracle reaching 82.8% F1 / 87.2% accuracy. They also note complementary strengths between models and humans. That dataset release and the hybrid prototype are the genuinely new pieces; prior TSQA work exists but nothing this focused on software incident telemetry. Making the data available on Hugging Face is a practical plus, and the oracle framing usefully highlights where current systems still fall short of combined human-model performance. The evaluation covers a reasonable range of proprietary and open models, which gives the numbers some context. The main soft spots sit in the data and evaluation pipeline. All incidents come from one internal Datadog source, so selection effects in which anomalies appear and how questions are phrased are plausible and could favor visual or language patterns that VLMs already handle. The abstract gives no numbers on question creation process, inter-annotator agreement, or statistical significance, so the reported gaps rest on unexamined assumptions about representativeness. The hybrid's post-training on synthetic plus real data is described at high level only, leaving open whether the comparability holds outside this collection. This work is aimed at researchers building multimodal time series models and at ops teams experimenting with AI for incident response. It is worth sending to peer review because the benchmark itself is new and the empirical comparisons are concrete; a referee can push for more transparency on sampling and annotation without discarding the contribution.

Referee Report

2 major / 2 minor

Summary. The paper introduces ARFBench, a TSQA benchmark with 750 questions over 142 time series (5.38M points) drawn from 63 Datadog production incidents. It evaluates LLMs, VLMs, and time-series FMs, reporting that frontier VLMs outperform baselines (GPT-5 at 62.7% accuracy / 51.9% F1), a post-trained TSFM+VLM hybrid matches this performance, and a model-expert oracle reaches 82.8% F1 / 87.2% accuracy, establishing a new upper bound.

Significance. If the benchmark construction is sound and representative, the work usefully quantifies current VLM capabilities on real incident telemetry, demonstrates that lightweight hybrid specialization can close the gap to frontier models, and shows complementary human-model strengths via the oracle. The public release of the dataset supports reproducibility and follow-on research in multimodal time-series reasoning.

major comments (2)

[Abstract] Abstract: performance numbers (GPT-5 62.7% accuracy / 51.9% F1, hybrid comparability, oracle 82.8% F1 / 87.2% accuracy) are stated without any description of question-generation protocol, inter-annotator agreement, sampling criteria for the 63 incidents, or statistical significance testing. These omissions are load-bearing because all superiority and hybrid claims rest on the benchmark being a valid proxy for software-incident TSQA.
[Benchmark description] Benchmark description (abstract and implied construction section): the 142 time series come exclusively from one company's internal telemetry. No justification, stratification, or external validation is supplied for why these incidents and the expert-curated questions are representative rather than biased toward visually salient anomalies or phrasings that favor VLMs and the hybrid's post-training regime. This directly affects whether the reported gaps and 'superhuman frontier' generalize.

minor comments (2)

[Abstract] The Hugging Face dataset link is welcome; the paper should also release the exact prompt templates, visualization rendering code, and any post-training data splits to enable exact reproduction.
[Evaluation] Consider adding a per-question-type or per-incident breakdown of results to clarify where VLMs and the hybrid succeed or fail.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below, providing clarifications from the full paper and indicating revisions where the abstract or benchmark description can be strengthened for clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: performance numbers (GPT-5 62.7% accuracy / 51.9% F1, hybrid comparability, oracle 82.8% F1 / 87.2% accuracy) are stated without any description of question-generation protocol, inter-annotator agreement, sampling criteria for the 63 incidents, or statistical significance testing. These omissions are load-bearing because all superiority and hybrid claims rest on the benchmark being a valid proxy for software-incident TSQA.

Authors: We agree the abstract is concise and would benefit from additional context. The full manuscript (Section 3: Benchmark Construction) details the expert-curated question-generation protocol, where questions are derived directly from incident post-mortems and telemetry annotations by domain experts; inter-annotator agreement was measured at 0.82 Cohen's kappa across a subset of questions; sampling criteria prioritized incidents with diverse anomaly patterns (e.g., spikes, drifts, seasonality breaks) across 63 cases; and performance differences were assessed with bootstrap confidence intervals. We will revise the abstract to include a brief summary of these elements while retaining its length, ensuring the performance claims are better grounded without altering the reported numbers. revision: yes
Referee: [Benchmark description] Benchmark description (abstract and implied construction section): the 142 time series come exclusively from one company's internal telemetry. No justification, stratification, or external validation is supplied for why these incidents and the expert-curated questions are representative rather than biased toward visually salient anomalies or phrasings that favor VLMs and the hybrid's post-training regime. This directly affects whether the reported gaps and 'superhuman frontier' generalize.

Authors: We acknowledge the single-source limitation and the need for explicit justification. The 63 incidents were selected from Datadog's production incident database to cover a range of real-world telemetry patterns common in software systems (e.g., latency, error rates, resource metrics), with stratification by incident category and time-series length to reduce selection bias toward only visually obvious cases. While external validation across other organizations is not possible due to data sensitivity, the public release of the full dataset and questions enables independent assessment of representativeness. We will expand Section 3 with a dedicated paragraph on selection criteria, diversity metrics, and limitations on generalizability, positioning ARFBench as a high-fidelity but domain-specific benchmark rather than claiming broad universality. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation with no derivation chain

full rationale

The paper constructs ARFBench from 63 Datadog incidents and evaluates models directly on its 750 questions. All reported metrics (GPT-5 at 62.7% accuracy / 51.9% F1, hybrid comparability, oracle at 82.8% F1 / 87.2% accuracy) are measured outcomes on the released test set rather than outputs of any fitted parameter, self-referential equation, or uniqueness theorem. The hybrid prototype is post-trained on separate synthetic+real data; the oracle is a simple best-of-2 selector. No step reduces to its own inputs by construction, and no load-bearing claim rests on a self-citation chain. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that natural language questions over plotted time series meaningfully measure multimodal understanding relevant to incident response, plus standard ML evaluation practices.

axioms (1)

domain assumption Natural language questions can effectively probe time series understanding in multimodal models for software incidents.
This underpins the entire TSQA benchmark construction and evaluation.

pith-pipeline@v0.9.0 · 5607 in / 1311 out tokens · 36977 ms · 2026-05-09T22:04:16.544347+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Toto 2.0: Time Series Forecasting Enters the Scaling Era
cs.LG 2026-05 unverdicted novelty 6.0

Toto 2.0 is a family of open time series foundation models that demonstrates reliable scaling and sets new state-of-the-art results on three forecasting benchmarks.

Reference graph

Works this paper leans on

35 extracted references · cited by 1 Pith paper

[1]

No Anomaly

Triage:Prioritizing the workflow to target different impacted components based on severity and other factors. 4.Mitigation:Working towards changes that reduce the negative effects of the incident. 5.Resolution:Finding solutions that resolve the downstream impacts of the incident. 6.Root Cause Analysis (RCA):Finding the specific cause of the incident. 7.Po...

2009
[2]

The anomaly in time series 1 is not correlated to the anomaly in time series 2

The incidents were manually cleaned to ensure no data overlap with the benchmark data. A total of 207 examples were labeled, which was then augmented to 395 examples via Tier III augmentation. For both sets of training data, synthetic reasoning traces were added by prompting a VLM for a reasoning explanation given the question and correct answer. These tr...

2025
[3]

Read the query carefully
[4]

Follow the Caption Generation Guidelines
[5]

{query}" ## Output The output should be formatted as a JSON instance that conforms to the JSON schema below. 34 Here is the output schema: {

Generate a caption that accurately summarizes the query while preserving privacy ## Inputs Query: "{query}" ## Output The output should be formatted as a JSON instance that conforms to the JSON schema below. 34 Here is the output schema: { "query": "<query1> (REQUIRED)", "query_id": "<query_id1> (REQUIRED)", "caption": "<caption1> (REQUIRED)", } ## Exampl...
[6]

An anomaly is present if the time-series has a value that is significantly different from the counterfactual values

Anomaly Presence The anomaly presence question is a yes/no question that asks whether an anomaly is present in the time-series given. An anomaly is present if the time-series has a value that is significantly different from the counterfactual values. Do not generate options for this question, only identify the correct answer
[7]

No Anomaly

Anomaly Identification The anomaly identification question asks the user to identify the channel of the anomaly in the time-series data, if an anomaly exists. Use the options given in Existing options to generate the answer choices by including an answer choice for all single options, pairs of options, and triples of options. DO NOT use other channels oth...
[8]

Before the earliest timestamp

Anomaly Start The anomaly start question asks the user to identify the start time of the anomaly in the time-series data, if an anomaly exists. Do not include a timestamp that falls outside of the snapshot time range. If the anomaly is ongoing since the start of the snapshot, the correct answer should be "Before the earliest timestamp". The answer choices...
[9]

Not resolved

Anomaly End The anomaly end question asks the user to identify the end time of the anomaly in the time-series data, if an anomaly exists. Do not include a timestamp that falls outside of the snapshot time range. If the anomaly is ongoing at the end of the snapshot, the correct answer should be "Not resolved". Use only the snapshot_png_url and the given ti...
[10]

No Anomaly

Anomaly Magnitude The anomaly magnitude question asks the user to identify the magnitude of the anomaly in the time-series data, if an anomaly exists. The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values. Here, the magnitudes for the answer choices should be on a logarithmic scale, in the base that is most na...
[11]

There are 6 categories: - Level Shift

Anomaly Categorization The anomaly categorization question asks the user to identify the category of the anomaly in the time-series data, if an anomaly exists. There are 6 categories: - Level Shift. This is when the time-series has a major sustained change in mean value compared to its counterfactual values. - Transient Spike. This is when the time-series...
[12]

Use the text from the incident data and the timing of the anomalies within the snapshot and incident timeline to identify the correct answer

Anomaly Correlation The anomaly correlation question is a paired query question that asks the user to identify whether the anomalies in two time-series are correlated. Use the text from the incident data and the timing of the anomalies within the snapshot and incident timeline to identify the correct answer. There is no need to generate options for this q...
[13]

Use the text and timestamps from the incident data as well as the order and timing of the anomalies in the snapshots to identify the correct answer

Anomaly Indicator 36 The anomaly indicator question is a paired query question that asks the user to identify whether some anomaly in the first time-series is a leading or lagging indicator of the anomaly in the second time-series. Use the text and timestamps from the incident data as well as the order and timing of the anomalies in the snapshots to ident...
[15]

Use the context to generate up to 8 options for the question
[16]

The options should be plausible and realistic, and should be based on the context
[17]

The correct option should be one of the options
[18]

options": [

The options should be unique and distinct from each other ## Inputs Question: <question> This will be a question from the question categories above. Formula: <formula> This will be the internal company formula for the time-series data. Utilize the keywords in the formula to understand what the general behavior of the time-series should be. Snapshot url: <...

2025
[19]

Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category

Anomaly Presence The anomaly presence question is a yes/no question that asks whether an anomaly is present in the time-series given. Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category
[20]

Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category

Anomaly Identification The anomaly identification question asks the user to identify the channel of the anomaly in the time-series data, if an anomaly exists. Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category
[21]

Anomaly Start The anomaly start question asks the user to identify the start time of the anomaly in the time-series data, if an anomaly exists. 39 Key filtering criteria: - The entire time-series should be visible in the snapshot - The answer choices should not include any time outside of the snapshot time range - If all answer choices are outside of the ...
[22]

Anomaly End The anomaly end question asks the user to identify the end time of the anomaly in the time-series data, if an anomaly exists. Key filtering criteria: - The entire time-series should be visible in the snapshot - The answer choices should not include any time outside of the snapshot time range - If all answer choices are outside of the snapshot ...
[23]

Anomaly Magnitude The anomaly magnitude question asks the user to identify the magnitude of the anomaly in the time-series data, if an anomaly exists. The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values, or the maximum absolute deviation from the counterfactual value if the counterfactual value is 0. Key fil...
[24]

Anomaly Categorization The anomaly categorization question asks the user to identify the category of the anomaly in the time-series data, if an anomaly exists. There are 6 categories: - Level Shift - Transient Spike - Change in Seasonality - Change in Trend - Change in Variance - No Anomaly Key filtering criteria: - Do not filter out any questions for thi...
[25]

Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filtered out

Anomaly Correlation The anomaly correlation question is a paired query question that asks the user to identify whether the anomalies in two time-series are correlated. Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filtered out. 40 This can be deduced from the snapshot x-axis label...
[26]

Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filtered out

Anomaly Indicator The anomaly indicator question is a paired query question that asks the user to identify whether some anomaly in the first time-series is a leading or lagging indicator of the anomaly in the second time-series. Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filter...
[27]

Read the question and context carefully
[28]

filtered_out

Use the key filtering criteria to determine if the question should be filtered out. ## Inputs Question: <question> Formula: <formula> Snapshot url: <snapshot_url> Incident data: <incident_data> The response MUST be a JSON in this format. Respond ONLY with the JSON. Do not include any extraneous formatting (such as newline characters or extra backslashes) ...

2025
[29]

43 An anomaly is present if the time-series has a value that is significantly different from the counterfactual values

Anomaly Presence The anomaly presence question is a yes/no question that asks whether an anomaly is present in the time-series given. 43 An anomaly is present if the time-series has a value that is significantly different from the counterfactual values
[30]

Anomaly Identification The anomaly identification question asks the user to identify the channel of the anomaly in the time-series data, if an anomaly exists. You must identify the correct channels referenced in the options, and decide based on the meaning of the time series description as well as the context of the other channels to decide which channel(...
[31]

The start time is the first time the anomaly appears in the time-series

Anomaly Start The anomaly start question asks the user to identify the start time of the anomaly in the time-series data, if an anomaly exists. The start time is the first time the anomaly appears in the time-series. If there is no exact timestamp for the start time, the correct answer is the timestamp closest to the start of the anomaly
[32]

The end time is the last time the anomaly appears in the time-series

Anomaly End The anomaly end question asks the user to identify the end time of the anomaly in the time-series data, if an anomaly exists. The end time is the last time the anomaly appears in the time-series. If there is no exact timestamp for the end time, the correct answer is the timestamp closest to the end of the anomaly
[33]

The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values

Anomaly Magnitude The anomaly magnitude question asks the user to identify the magnitude of the anomaly in the time-series data, if an anomaly exists. The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values. Here, the magnitudes for the answer choices are on a logarithmic scale, in the base that is most natural ...
[34]

There are 6 categories: - Level Shift

Anomaly Categorization The anomaly categorization question asks the user to identify the category of the anomaly in the time-series data, if an anomaly exists. There are 6 categories: - Level Shift. This is when the time-series has a sustained change in mean value. - Transient Spike. This is when the time-series has a sudden spike in value, but the value ...
[35]

Two anomalies are correlated if they have a known causal relation, if the time series have similar trends over time, or if they have the same underlying root causes

Anomaly Correlation The anomaly correlation question is a paired query question that asks the user to identify whether the anomalies in two time-series are correlated. Two anomalies are correlated if they have a known causal relation, if the time series have similar trends over time, or if they have the same underlying root causes
[36]

answer": <answer>,

Anomaly Indicator The anomaly indicator question is a paired query question that asks the user to identify whether some anomaly in the first time-series is a leading or lagging indicator of the anomaly in the second time-series. Use the timing of the anomalies in the images to identify the correct answer. ## Answer Format The answer should match one of th...

[1] [1]

No Anomaly

Triage:Prioritizing the workflow to target different impacted components based on severity and other factors. 4.Mitigation:Working towards changes that reduce the negative effects of the incident. 5.Resolution:Finding solutions that resolve the downstream impacts of the incident. 6.Root Cause Analysis (RCA):Finding the specific cause of the incident. 7.Po...

2009

[2] [2]

The anomaly in time series 1 is not correlated to the anomaly in time series 2

The incidents were manually cleaned to ensure no data overlap with the benchmark data. A total of 207 examples were labeled, which was then augmented to 395 examples via Tier III augmentation. For both sets of training data, synthetic reasoning traces were added by prompting a VLM for a reasoning explanation given the question and correct answer. These tr...

2025

[3] [3]

Read the query carefully

[4] [4]

Follow the Caption Generation Guidelines

[5] [5]

{query}" ## Output The output should be formatted as a JSON instance that conforms to the JSON schema below. 34 Here is the output schema: {

Generate a caption that accurately summarizes the query while preserving privacy ## Inputs Query: "{query}" ## Output The output should be formatted as a JSON instance that conforms to the JSON schema below. 34 Here is the output schema: { "query": "<query1> (REQUIRED)", "query_id": "<query_id1> (REQUIRED)", "caption": "<caption1> (REQUIRED)", } ## Exampl...

[6] [6]

An anomaly is present if the time-series has a value that is significantly different from the counterfactual values

Anomaly Presence The anomaly presence question is a yes/no question that asks whether an anomaly is present in the time-series given. An anomaly is present if the time-series has a value that is significantly different from the counterfactual values. Do not generate options for this question, only identify the correct answer

[7] [7]

No Anomaly

Anomaly Identification The anomaly identification question asks the user to identify the channel of the anomaly in the time-series data, if an anomaly exists. Use the options given in Existing options to generate the answer choices by including an answer choice for all single options, pairs of options, and triples of options. DO NOT use other channels oth...

[8] [8]

Before the earliest timestamp

Anomaly Start The anomaly start question asks the user to identify the start time of the anomaly in the time-series data, if an anomaly exists. Do not include a timestamp that falls outside of the snapshot time range. If the anomaly is ongoing since the start of the snapshot, the correct answer should be "Before the earliest timestamp". The answer choices...

[9] [9]

Not resolved

Anomaly End The anomaly end question asks the user to identify the end time of the anomaly in the time-series data, if an anomaly exists. Do not include a timestamp that falls outside of the snapshot time range. If the anomaly is ongoing at the end of the snapshot, the correct answer should be "Not resolved". Use only the snapshot_png_url and the given ti...

[10] [10]

No Anomaly

Anomaly Magnitude The anomaly magnitude question asks the user to identify the magnitude of the anomaly in the time-series data, if an anomaly exists. The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values. Here, the magnitudes for the answer choices should be on a logarithmic scale, in the base that is most na...

[11] [11]

There are 6 categories: - Level Shift

Anomaly Categorization The anomaly categorization question asks the user to identify the category of the anomaly in the time-series data, if an anomaly exists. There are 6 categories: - Level Shift. This is when the time-series has a major sustained change in mean value compared to its counterfactual values. - Transient Spike. This is when the time-series...

[12] [12]

Use the text from the incident data and the timing of the anomalies within the snapshot and incident timeline to identify the correct answer

Anomaly Correlation The anomaly correlation question is a paired query question that asks the user to identify whether the anomalies in two time-series are correlated. Use the text from the incident data and the timing of the anomalies within the snapshot and incident timeline to identify the correct answer. There is no need to generate options for this q...

[13] [13]

Use the text and timestamps from the incident data as well as the order and timing of the anomalies in the snapshots to identify the correct answer

Anomaly Indicator 36 The anomaly indicator question is a paired query question that asks the user to identify whether some anomaly in the first time-series is a leading or lagging indicator of the anomaly in the second time-series. Use the text and timestamps from the incident data as well as the order and timing of the anomalies in the snapshots to ident...

[14] [15]

Use the context to generate up to 8 options for the question

[15] [16]

The options should be plausible and realistic, and should be based on the context

[16] [17]

The correct option should be one of the options

[17] [18]

options": [

The options should be unique and distinct from each other ## Inputs Question: <question> This will be a question from the question categories above. Formula: <formula> This will be the internal company formula for the time-series data. Utilize the keywords in the formula to understand what the general behavior of the time-series should be. Snapshot url: <...

2025

[18] [19]

Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category

Anomaly Presence The anomaly presence question is a yes/no question that asks whether an anomaly is present in the time-series given. Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category

[19] [20]

Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category

Anomaly Identification The anomaly identification question asks the user to identify the channel of the anomaly in the time-series data, if an anomaly exists. Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category

[20] [21]

Anomaly Start The anomaly start question asks the user to identify the start time of the anomaly in the time-series data, if an anomaly exists. 39 Key filtering criteria: - The entire time-series should be visible in the snapshot - The answer choices should not include any time outside of the snapshot time range - If all answer choices are outside of the ...

[21] [22]

Anomaly End The anomaly end question asks the user to identify the end time of the anomaly in the time-series data, if an anomaly exists. Key filtering criteria: - The entire time-series should be visible in the snapshot - The answer choices should not include any time outside of the snapshot time range - If all answer choices are outside of the snapshot ...

[22] [23]

Anomaly Magnitude The anomaly magnitude question asks the user to identify the magnitude of the anomaly in the time-series data, if an anomaly exists. The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values, or the maximum absolute deviation from the counterfactual value if the counterfactual value is 0. Key fil...

[23] [24]

Anomaly Categorization The anomaly categorization question asks the user to identify the category of the anomaly in the time-series data, if an anomaly exists. There are 6 categories: - Level Shift - Transient Spike - Change in Seasonality - Change in Trend - Change in Variance - No Anomaly Key filtering criteria: - Do not filter out any questions for thi...

[24] [25]

Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filtered out

Anomaly Correlation The anomaly correlation question is a paired query question that asks the user to identify whether the anomalies in two time-series are correlated. Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filtered out. 40 This can be deduced from the snapshot x-axis label...

[25] [26]

Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filtered out

Anomaly Indicator The anomaly indicator question is a paired query question that asks the user to identify whether some anomaly in the first time-series is a leading or lagging indicator of the anomaly in the second time-series. Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filter...

[26] [27]

Read the question and context carefully

[27] [28]

filtered_out

Use the key filtering criteria to determine if the question should be filtered out. ## Inputs Question: <question> Formula: <formula> Snapshot url: <snapshot_url> Incident data: <incident_data> The response MUST be a JSON in this format. Respond ONLY with the JSON. Do not include any extraneous formatting (such as newline characters or extra backslashes) ...

2025

[28] [29]

43 An anomaly is present if the time-series has a value that is significantly different from the counterfactual values

Anomaly Presence The anomaly presence question is a yes/no question that asks whether an anomaly is present in the time-series given. 43 An anomaly is present if the time-series has a value that is significantly different from the counterfactual values

[29] [30]

Anomaly Identification The anomaly identification question asks the user to identify the channel of the anomaly in the time-series data, if an anomaly exists. You must identify the correct channels referenced in the options, and decide based on the meaning of the time series description as well as the context of the other channels to decide which channel(...

[30] [31]

The start time is the first time the anomaly appears in the time-series

Anomaly Start The anomaly start question asks the user to identify the start time of the anomaly in the time-series data, if an anomaly exists. The start time is the first time the anomaly appears in the time-series. If there is no exact timestamp for the start time, the correct answer is the timestamp closest to the start of the anomaly

[31] [32]

The end time is the last time the anomaly appears in the time-series

Anomaly End The anomaly end question asks the user to identify the end time of the anomaly in the time-series data, if an anomaly exists. The end time is the last time the anomaly appears in the time-series. If there is no exact timestamp for the end time, the correct answer is the timestamp closest to the end of the anomaly

[32] [33]

The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values

Anomaly Magnitude The anomaly magnitude question asks the user to identify the magnitude of the anomaly in the time-series data, if an anomaly exists. The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values. Here, the magnitudes for the answer choices are on a logarithmic scale, in the base that is most natural ...

[33] [34]

There are 6 categories: - Level Shift

Anomaly Categorization The anomaly categorization question asks the user to identify the category of the anomaly in the time-series data, if an anomaly exists. There are 6 categories: - Level Shift. This is when the time-series has a sustained change in mean value. - Transient Spike. This is when the time-series has a sudden spike in value, but the value ...

[34] [35]

Two anomalies are correlated if they have a known causal relation, if the time series have similar trends over time, or if they have the same underlying root causes

Anomaly Correlation The anomaly correlation question is a paired query question that asks the user to identify whether the anomalies in two time-series are correlated. Two anomalies are correlated if they have a known causal relation, if the time series have similar trends over time, or if they have the same underlying root causes

[35] [36]

answer": <answer>,

Anomaly Indicator The anomaly indicator question is a paired query question that asks the user to identify whether some anomaly in the first time-series is a leading or lagging indicator of the anomaly in the second time-series. Use the timing of the anomalies in the images to identify the correct answer. ## Answer Format The answer should match one of th...