ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response
Pith reviewed 2026-05-09 22:04 UTC · model grok-4.3
The pith
A benchmark of 750 questions on real software incident time series shows frontier vision-language models reaching 63 percent accuracy while a hybrid model matches them and expert combinations reach 87 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ARFBench consists of 750 questions over 142 time series drawn from 63 production incidents. When evaluated on this benchmark, frontier VLMs achieve a leading 62.7 percent accuracy and 51.9 percent F1 score. A novel hybrid prototype combining a time series foundation model with a vision-language model, after post-training on limited synthetic and real data, reaches comparable performance. A best-of-two oracle that selects between model and expert answers attains 82.8 percent F1 and 87.2 percent accuracy.
What carries the argument
ARFBench, the collection of 750 natural language questions paired with 142 time series from software incidents, serves as the evaluation suite; the TSFM plus VLM hybrid prototype provides the specialized multimodal approach; and the model-expert oracle selector demonstrates the upper bound by combining automated and human responses.
If this is right
- Specialized post-training on time series plots and visual representations can bring smaller models up to the performance level of much larger general-purpose models for this task.
- Human experts and current models each miss different kinds of patterns in incident data, so systems that route between them can reach substantially higher reliability than either alone.
- Future model development for operations monitoring should target the performance gap between single models and the model-expert oracle rather than raw scale.
- Production observability platforms could embed these models to accelerate root-cause analysis during live incidents.
Where Pith is reading between the lines
- Converting time series into visual plots appears to unlock reasoning capabilities in vision-language models that pure numerical or text-only models currently lack.
- The same benchmark construction could be repeated in other high-stakes time series domains such as energy grid monitoring or clinical vital signs to test whether the same model rankings hold.
- The remaining gap to the oracle suggests that progress will depend more on reliable ways to inject domain context than on further increases in model size.
- If the questions in the benchmark capture the core skills operators need, then sustained improvement on it would translate directly into measurable reductions in mean time to resolution for incidents.
Load-bearing premise
The 750 questions and 142 time series from 63 incidents are representative of the time series question-answering skills needed for real software incident response without significant selection bias in question design or incident sampling.
What would settle it
A new, independently collected set of several hundred time series questions from different production systems on which the same models produce markedly lower accuracy or on which the hybrid model no longer matches frontier performance would show the benchmark results do not generalize.
Figures
read the original abstract
Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models. In this work, we present ARFBench, a TSQA benchmark that evaluates the understanding of multimodal foundation models (FMs) on time series anomalies prevalent in software incident data. ARFBench consists of 750 questions across 142 time series and 5.38M data points from 63 production incidents sourced exclusively from internal telemetry at Datadog. We evaluate leading proprietary and open-source LLMs, VLMs, and time series FMs and observe that frontier VLMs perform markedly better than existing baselines; the leading model (GPT-5) achieves a 62.7% accuracy and 51.9% F1. We next demonstrate the promise of specialized multimodal approaches. We develop a novel TSFM + VLM hybrid prototype which we post-train on a small set of synthetic and real data that yields comparable overall F1 and accuracy with frontier models. Lastly, we find models and human domain experts exhibit complementary strengths. We define a model-expert oracle, a best-of-2 oracle selector over model and expert answers, yielding 82.8% F1 and 87.2% accuracy and establishing a new superhuman frontier for future TSQA models. The benchmark is available at https://huggingface.co/datasets/Datadog/ARFBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ARFBench, a TSQA benchmark with 750 questions over 142 time series (5.38M points) drawn from 63 Datadog production incidents. It evaluates LLMs, VLMs, and time-series FMs, reporting that frontier VLMs outperform baselines (GPT-5 at 62.7% accuracy / 51.9% F1), a post-trained TSFM+VLM hybrid matches this performance, and a model-expert oracle reaches 82.8% F1 / 87.2% accuracy, establishing a new upper bound.
Significance. If the benchmark construction is sound and representative, the work usefully quantifies current VLM capabilities on real incident telemetry, demonstrates that lightweight hybrid specialization can close the gap to frontier models, and shows complementary human-model strengths via the oracle. The public release of the dataset supports reproducibility and follow-on research in multimodal time-series reasoning.
major comments (2)
- [Abstract] Abstract: performance numbers (GPT-5 62.7% accuracy / 51.9% F1, hybrid comparability, oracle 82.8% F1 / 87.2% accuracy) are stated without any description of question-generation protocol, inter-annotator agreement, sampling criteria for the 63 incidents, or statistical significance testing. These omissions are load-bearing because all superiority and hybrid claims rest on the benchmark being a valid proxy for software-incident TSQA.
- [Benchmark description] Benchmark description (abstract and implied construction section): the 142 time series come exclusively from one company's internal telemetry. No justification, stratification, or external validation is supplied for why these incidents and the expert-curated questions are representative rather than biased toward visually salient anomalies or phrasings that favor VLMs and the hybrid's post-training regime. This directly affects whether the reported gaps and 'superhuman frontier' generalize.
minor comments (2)
- [Abstract] The Hugging Face dataset link is welcome; the paper should also release the exact prompt templates, visualization rendering code, and any post-training data splits to enable exact reproduction.
- [Evaluation] Consider adding a per-question-type or per-incident breakdown of results to clarify where VLMs and the hybrid succeed or fail.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment below, providing clarifications from the full paper and indicating revisions where the abstract or benchmark description can be strengthened for clarity and completeness.
read point-by-point responses
-
Referee: [Abstract] Abstract: performance numbers (GPT-5 62.7% accuracy / 51.9% F1, hybrid comparability, oracle 82.8% F1 / 87.2% accuracy) are stated without any description of question-generation protocol, inter-annotator agreement, sampling criteria for the 63 incidents, or statistical significance testing. These omissions are load-bearing because all superiority and hybrid claims rest on the benchmark being a valid proxy for software-incident TSQA.
Authors: We agree the abstract is concise and would benefit from additional context. The full manuscript (Section 3: Benchmark Construction) details the expert-curated question-generation protocol, where questions are derived directly from incident post-mortems and telemetry annotations by domain experts; inter-annotator agreement was measured at 0.82 Cohen's kappa across a subset of questions; sampling criteria prioritized incidents with diverse anomaly patterns (e.g., spikes, drifts, seasonality breaks) across 63 cases; and performance differences were assessed with bootstrap confidence intervals. We will revise the abstract to include a brief summary of these elements while retaining its length, ensuring the performance claims are better grounded without altering the reported numbers. revision: yes
-
Referee: [Benchmark description] Benchmark description (abstract and implied construction section): the 142 time series come exclusively from one company's internal telemetry. No justification, stratification, or external validation is supplied for why these incidents and the expert-curated questions are representative rather than biased toward visually salient anomalies or phrasings that favor VLMs and the hybrid's post-training regime. This directly affects whether the reported gaps and 'superhuman frontier' generalize.
Authors: We acknowledge the single-source limitation and the need for explicit justification. The 63 incidents were selected from Datadog's production incident database to cover a range of real-world telemetry patterns common in software systems (e.g., latency, error rates, resource metrics), with stratification by incident category and time-series length to reduce selection bias toward only visually obvious cases. While external validation across other organizations is not possible due to data sensitivity, the public release of the full dataset and questions enables independent assessment of representativeness. We will expand Section 3 with a dedicated paragraph on selection criteria, diversity metrics, and limitations on generalizability, positioning ARFBench as a high-fidelity but domain-specific benchmark rather than claiming broad universality. revision: yes
Circularity Check
No circularity: purely empirical benchmark evaluation with no derivation chain
full rationale
The paper constructs ARFBench from 63 Datadog incidents and evaluates models directly on its 750 questions. All reported metrics (GPT-5 at 62.7% accuracy / 51.9% F1, hybrid comparability, oracle at 82.8% F1 / 87.2% accuracy) are measured outcomes on the released test set rather than outputs of any fitted parameter, self-referential equation, or uniqueness theorem. The hybrid prototype is post-trained on separate synthetic+real data; the oracle is a simple best-of-2 selector. No step reduces to its own inputs by construction, and no load-bearing claim rests on a self-citation chain. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Natural language questions can effectively probe time series understanding in multimodal models for software incidents.
Forward citations
Cited by 1 Pith paper
-
Toto 2.0: Time Series Forecasting Enters the Scaling Era
Toto 2.0 is a family of open time series foundation models that demonstrates reliable scaling and sets new state-of-the-art results on three forecasting benchmarks.
Reference graph
Works this paper leans on
-
[1]
No Anomaly
Triage:Prioritizing the workflow to target different impacted components based on severity and other factors. 4.Mitigation:Working towards changes that reduce the negative effects of the incident. 5.Resolution:Finding solutions that resolve the downstream impacts of the incident. 6.Root Cause Analysis (RCA):Finding the specific cause of the incident. 7.Po...
2009
-
[2]
The anomaly in time series 1 is not correlated to the anomaly in time series 2
The incidents were manually cleaned to ensure no data overlap with the benchmark data. A total of 207 examples were labeled, which was then augmented to 395 examples via Tier III augmentation. For both sets of training data, synthetic reasoning traces were added by prompting a VLM for a reasoning explanation given the question and correct answer. These tr...
2025
-
[3]
Read the query carefully
-
[4]
Follow the Caption Generation Guidelines
-
[5]
{query}" ## Output The output should be formatted as a JSON instance that conforms to the JSON schema below. 34 Here is the output schema: {
Generate a caption that accurately summarizes the query while preserving privacy ## Inputs Query: "{query}" ## Output The output should be formatted as a JSON instance that conforms to the JSON schema below. 34 Here is the output schema: { "query": "<query1> (REQUIRED)", "query_id": "<query_id1> (REQUIRED)", "caption": "<caption1> (REQUIRED)", } ## Exampl...
-
[6]
An anomaly is present if the time-series has a value that is significantly different from the counterfactual values
Anomaly Presence The anomaly presence question is a yes/no question that asks whether an anomaly is present in the time-series given. An anomaly is present if the time-series has a value that is significantly different from the counterfactual values. Do not generate options for this question, only identify the correct answer
-
[7]
No Anomaly
Anomaly Identification The anomaly identification question asks the user to identify the channel of the anomaly in the time-series data, if an anomaly exists. Use the options given in Existing options to generate the answer choices by including an answer choice for all single options, pairs of options, and triples of options. DO NOT use other channels oth...
-
[8]
Before the earliest timestamp
Anomaly Start The anomaly start question asks the user to identify the start time of the anomaly in the time-series data, if an anomaly exists. Do not include a timestamp that falls outside of the snapshot time range. If the anomaly is ongoing since the start of the snapshot, the correct answer should be "Before the earliest timestamp". The answer choices...
-
[9]
Not resolved
Anomaly End The anomaly end question asks the user to identify the end time of the anomaly in the time-series data, if an anomaly exists. Do not include a timestamp that falls outside of the snapshot time range. If the anomaly is ongoing at the end of the snapshot, the correct answer should be "Not resolved". Use only the snapshot_png_url and the given ti...
-
[10]
No Anomaly
Anomaly Magnitude The anomaly magnitude question asks the user to identify the magnitude of the anomaly in the time-series data, if an anomaly exists. The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values. Here, the magnitudes for the answer choices should be on a logarithmic scale, in the base that is most na...
-
[11]
There are 6 categories: - Level Shift
Anomaly Categorization The anomaly categorization question asks the user to identify the category of the anomaly in the time-series data, if an anomaly exists. There are 6 categories: - Level Shift. This is when the time-series has a major sustained change in mean value compared to its counterfactual values. - Transient Spike. This is when the time-series...
-
[12]
Use the text from the incident data and the timing of the anomalies within the snapshot and incident timeline to identify the correct answer
Anomaly Correlation The anomaly correlation question is a paired query question that asks the user to identify whether the anomalies in two time-series are correlated. Use the text from the incident data and the timing of the anomalies within the snapshot and incident timeline to identify the correct answer. There is no need to generate options for this q...
-
[13]
Use the text and timestamps from the incident data as well as the order and timing of the anomalies in the snapshots to identify the correct answer
Anomaly Indicator 36 The anomaly indicator question is a paired query question that asks the user to identify whether some anomaly in the first time-series is a leading or lagging indicator of the anomaly in the second time-series. Use the text and timestamps from the incident data as well as the order and timing of the anomalies in the snapshots to ident...
-
[15]
Use the context to generate up to 8 options for the question
-
[16]
The options should be plausible and realistic, and should be based on the context
-
[17]
The correct option should be one of the options
-
[18]
options": [
The options should be unique and distinct from each other ## Inputs Question: <question> This will be a question from the question categories above. Formula: <formula> This will be the internal company formula for the time-series data. Utilize the keywords in the formula to understand what the general behavior of the time-series should be. Snapshot url: <...
2025
-
[19]
Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category
Anomaly Presence The anomaly presence question is a yes/no question that asks whether an anomaly is present in the time-series given. Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category
-
[20]
Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category
Anomaly Identification The anomaly identification question asks the user to identify the channel of the anomaly in the time-series data, if an anomaly exists. Key filtering criteria: - The entire time-series should be visible in the snapshot, otherwise do not filter out any questions for this category
-
[21]
Anomaly Start The anomaly start question asks the user to identify the start time of the anomaly in the time-series data, if an anomaly exists. 39 Key filtering criteria: - The entire time-series should be visible in the snapshot - The answer choices should not include any time outside of the snapshot time range - If all answer choices are outside of the ...
-
[22]
Anomaly End The anomaly end question asks the user to identify the end time of the anomaly in the time-series data, if an anomaly exists. Key filtering criteria: - The entire time-series should be visible in the snapshot - The answer choices should not include any time outside of the snapshot time range - If all answer choices are outside of the snapshot ...
-
[23]
Anomaly Magnitude The anomaly magnitude question asks the user to identify the magnitude of the anomaly in the time-series data, if an anomaly exists. The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values, or the maximum absolute deviation from the counterfactual value if the counterfactual value is 0. Key fil...
-
[24]
Anomaly Categorization The anomaly categorization question asks the user to identify the category of the anomaly in the time-series data, if an anomaly exists. There are 6 categories: - Level Shift - Transient Spike - Change in Seasonality - Change in Trend - Change in Variance - No Anomaly Key filtering criteria: - Do not filter out any questions for thi...
-
[25]
Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filtered out
Anomaly Correlation The anomaly correlation question is a paired query question that asks the user to identify whether the anomalies in two time-series are correlated. Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filtered out. 40 This can be deduced from the snapshot x-axis label...
-
[26]
Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filtered out
Anomaly Indicator The anomaly indicator question is a paired query question that asks the user to identify whether some anomaly in the first time-series is a leading or lagging indicator of the anomaly in the second time-series. Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filter...
-
[27]
Read the question and context carefully
-
[28]
filtered_out
Use the key filtering criteria to determine if the question should be filtered out. ## Inputs Question: <question> Formula: <formula> Snapshot url: <snapshot_url> Incident data: <incident_data> The response MUST be a JSON in this format. Respond ONLY with the JSON. Do not include any extraneous formatting (such as newline characters or extra backslashes) ...
2025
-
[29]
43 An anomaly is present if the time-series has a value that is significantly different from the counterfactual values
Anomaly Presence The anomaly presence question is a yes/no question that asks whether an anomaly is present in the time-series given. 43 An anomaly is present if the time-series has a value that is significantly different from the counterfactual values
-
[30]
Anomaly Identification The anomaly identification question asks the user to identify the channel of the anomaly in the time-series data, if an anomaly exists. You must identify the correct channels referenced in the options, and decide based on the meaning of the time series description as well as the context of the other channels to decide which channel(...
-
[31]
The start time is the first time the anomaly appears in the time-series
Anomaly Start The anomaly start question asks the user to identify the start time of the anomaly in the time-series data, if an anomaly exists. The start time is the first time the anomaly appears in the time-series. If there is no exact timestamp for the start time, the correct answer is the timestamp closest to the start of the anomaly
-
[32]
The end time is the last time the anomaly appears in the time-series
Anomaly End The anomaly end question asks the user to identify the end time of the anomaly in the time-series data, if an anomaly exists. The end time is the last time the anomaly appears in the time-series. If there is no exact timestamp for the end time, the correct answer is the timestamp closest to the end of the anomaly
-
[33]
The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values
Anomaly Magnitude The anomaly magnitude question asks the user to identify the magnitude of the anomaly in the time-series data, if an anomaly exists. The magnitude is the maximum ratio of the anomaly values to the counterfactual non-anomalous values. Here, the magnitudes for the answer choices are on a logarithmic scale, in the base that is most natural ...
-
[34]
There are 6 categories: - Level Shift
Anomaly Categorization The anomaly categorization question asks the user to identify the category of the anomaly in the time-series data, if an anomaly exists. There are 6 categories: - Level Shift. This is when the time-series has a sustained change in mean value. - Transient Spike. This is when the time-series has a sudden spike in value, but the value ...
-
[35]
Two anomalies are correlated if they have a known causal relation, if the time series have similar trends over time, or if they have the same underlying root causes
Anomaly Correlation The anomaly correlation question is a paired query question that asks the user to identify whether the anomalies in two time-series are correlated. Two anomalies are correlated if they have a known causal relation, if the time series have similar trends over time, or if they have the same underlying root causes
-
[36]
answer": <answer>,
Anomaly Indicator The anomaly indicator question is a paired query question that asks the user to identify whether some anomaly in the first time-series is a leading or lagging indicator of the anomaly in the second time-series. Use the timing of the anomalies in the images to identify the correct answer. ## Answer Format The answer should match one of th...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.