ARFBench shows vision-language models lead time series question answering for software incidents at 62.7% accuracy, a hybrid TSFM+VLM matches them, and a model-expert oracle reaches 87.2% accuracy.
Key filtering criteria: - If the two time-series have completely non-overlapping time ranges, then the question should be filtered out
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response
ARFBench shows vision-language models lead time series question answering for software incidents at 62.7% accuracy, a hybrid TSFM+VLM matches them, and a model-expert oracle reaches 87.2% accuracy.