CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic
Pith reviewed 2026-05-08 02:22 UTC · model grok-4.3
The pith
CAN-QA turns raw vehicle network logs into natural-language questions to test how well models reason about traffic behavior.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAN-QA converts raw CAN logs into temporally segmented windows and applies deterministic rule-based templates to generate 33,128 natural-language QA pairs across 10 categories, each targeting distinct semantic and temporal properties of CAN traffic. Evaluation of large language models shows that although these models capture superficial statistical regularities, they struggle with temporal reasoning, multi-condition inference, and higher-level behavioral interpretation.
What carries the argument
The rule-based template generator that segments CAN logs into time windows and automatically derives natural-language questions together with their ground-truth answers from the logs' semantic and temporal structure.
If this is right
- Forensic analysis of CAN traffic can shift from label prediction to explicit question-driven reasoning that matches how analysts actually work.
- Model weaknesses in temporal and multi-condition questions identify concrete skills that need improvement before LLMs can be trusted for vehicle security tasks.
- The automatically generated dataset supplies a standardized, reproducible testbed for comparing future models on CAN understanding.
- The same template-driven creation process can be reused to produce evaluation data at low cost for other log-based security domains.
Where Pith is reading between the lines
- The same log-to-question conversion method could be adapted to create QA benchmarks for other time-series protocols such as industrial control systems.
- Strong performance on CAN-QA might indicate an LLM's broader suitability for explaining anomalies in any sequential sensor stream.
- Extending the benchmark with human-written questions could reveal whether the current templates miss important forensic scenarios.
Load-bearing premise
The deterministic rule-based templates generate questions and ground-truth answers that accurately reflect the semantic, temporal, and relational properties required for real-world forensic workflows over CAN traffic.
What would settle it
A comparison in which practicing automotive security analysts manually answer a random sample of the generated questions and their answers are checked against the benchmark's automatic ground truth for consistency on temporal and multi-condition items.
Figures
read the original abstract
The Controller Area Network (CAN) is a safety-critical in-vehicle communication protocol that lacks built-in security mechanisms, making intrusion detection essential. Existing approaches predominantly formulate CAN intrusion detection as a classification task, mapping complex traffic patterns to attack labels. However, this formulation abstracts away the temporal and relational structure of CAN traffic and misaligns with real-world forensic workflows, which require systematic reasoning about traffic behavior. To address this gap, we introduce CAN-QA, the first benchmark that reformulates CAN traffic analysis as a question-answering (QA) task. CAN-QA converts raw CAN logs into temporally segmented windows and applies deterministic rule-based templates to generate natural-language questions paired with automatically derived ground-truth answers. The resulting dataset comprises 33,128 QA pairs across 10 categories, each targeting distinct semantic and temporal properties of CAN traffic. Using CAN-QA, we evaluate large language models across both True/False and multiple-choice formats. Our results indicate that, although these models capture superficial statistical regularities, they struggle with temporal reasoning, multi-condition inference, and higher-level behavioral interpretation. Our code is available at https://github.com/Kriiiiss/CAN-QA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CAN-QA, the first benchmark reformulating CAN traffic analysis as a QA task. Raw CAN logs are converted into temporally segmented windows, from which deterministic rule-based templates automatically generate 33,128 natural-language QA pairs across 10 categories targeting semantic, temporal, and relational properties of the traffic. LLM evaluations in true/false and multiple-choice formats show that models capture superficial statistical regularities but struggle with temporal reasoning, multi-condition inference, and higher-level behavioral interpretation. The code is released publicly.
Significance. If the templates are shown to faithfully require the claimed forms of reasoning without artifacts, this would be a meaningful contribution by shifting CAN security analysis from classification to a QA formulation that better matches forensic workflows. The public code and scale of the dataset provide clear reproducibility value for the automotive security and LLM evaluation communities.
major comments (2)
- [§3 (CAN-QA Construction)] §3 (CAN-QA Construction): The deterministic rule-based templates are asserted to generate questions and ground-truth answers that specifically test temporal reasoning, multi-condition inference, and behavioral interpretation, yet no human validation, expert review, inter-annotator agreement, or accuracy metrics on the generated pairs are reported. This is load-bearing for the headline result, because without such checks the observed LLM performance gaps cannot be confidently attributed to reasoning deficits rather than possible template shortcuts, ambiguities, or incorrect ground truths.
- [§4 (Experiments and Results)] §4 (Experiments and Results): The attribution of specific failure modes (temporal reasoning, multi-condition inference, behavioral interpretation) requires per-category accuracy tables, error analysis, or ablation studies showing that each category cannot be solved by superficial statistics alone; the current high-level description does not supply this evidence.
minor comments (1)
- [Abstract] Abstract: Adding one or two concrete performance figures (e.g., accuracy ranges or category-wise gaps) would make the claims more immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important aspects of validation and analysis that will strengthen the manuscript. We appreciate the positive assessment of the benchmark's novelty, scale, and reproducibility value. We address each major comment below and outline specific revisions.
read point-by-point responses
-
Referee: [§3 (CAN-QA Construction)] §3 (CAN-QA Construction): The deterministic rule-based templates are asserted to generate questions and ground-truth answers that specifically test temporal reasoning, multi-condition inference, and behavioral interpretation, yet no human validation, expert review, inter-annotator agreement, or accuracy metrics on the generated pairs are reported. This is load-bearing for the headline result, because without such checks the observed LLM performance gaps cannot be confidently attributed to reasoning deficits rather than possible template shortcuts, ambiguities, or incorrect ground truths.
Authors: We agree that explicit validation metrics would increase confidence in the benchmark. The templates are fully deterministic, with ground-truth answers computed directly from the segmented CAN windows using protocol-defined rules (e.g., exact temporal ordering checks and multi-signal conjunctions), eliminating ambiguity by construction. However, we acknowledge the absence of reported human review or inter-annotator agreement in the current version. In revision, we will add: (1) a detailed appendix with template pseudocode and example generations for each of the 10 categories, (2) results from a small-scale expert review (two automotive security researchers manually verifying 200 randomly sampled pairs for correctness and reasoning type), and (3) a quantitative check confirming 100% template-to-answer fidelity on the sampled set. This will directly address potential shortcuts or errors. revision: partial
-
Referee: [§4 (Experiments and Results)] §4 (Experiments and Results): The attribution of specific failure modes (temporal reasoning, multi-condition inference, behavioral interpretation) requires per-category accuracy tables, error analysis, or ablation studies showing that each category cannot be solved by superficial statistics alone; the current high-level description does not supply this evidence.
Authors: We concur that granular evidence is needed to rigorously attribute the observed gaps. The current manuscript reports aggregate accuracies and provides qualitative failure examples. For the revision, we will expand §4 with: (1) per-category accuracy tables for both True/False and multiple-choice formats across all evaluated models, (2) a systematic error analysis breaking down mistakes by type (temporal, multi-condition, behavioral) with representative examples, and (3) ablation studies on a subset of categories, including performance on temporally shuffled windows and on questions with reduced conditions, to demonstrate that models cannot rely on superficial co-occurrence statistics alone for the harder categories. These additions will be supported by new figures and tables. revision: yes
Circularity Check
No significant circularity: benchmark generation and evaluation remain independent
full rationale
The paper derives the CAN-QA dataset by applying fixed deterministic rule-based templates to raw CAN logs, producing questions and ground-truth answers by construction from the input traffic windows. Model performance is then measured against this fixed dataset in a separate evaluation step. No equations, fitted parameters, or predictions reduce to the inputs by definition; no load-bearing self-citations appear; and the central claim (LLM limitations on temporal/multi-condition reasoning) is an empirical observation on the generated data rather than a tautology. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Deterministic rule-based templates can produce QA pairs that capture distinct semantic and temporal properties of CAN traffic.
Reference graph
Works this paper leans on
-
[1]
In-vehicle networks outlook: Achievements and challenges,
W. Zeng, M. A. Khalid, and S. Chowdhury, “In-vehicle networks outlook: Achievements and challenges,”IEEE Communications Surveys & Tutorials, vol. 18, no. 3, pp. 1552–1571, 2016
2016
-
[2]
{WIP}: Intrusion detection and localization for{CAN} by extracting propagation delay features from message intervals,
Z. Tanget al., “{WIP}: Intrusion detection and localization for{CAN} by extracting propagation delay features from message intervals,” in3rd USENIX Symposium on Vehicle Security and Privacy (VehicleSec 25), 2025, pp. 19–26
2025
-
[3]
A survey and com- parative analysis of security properties of can authentication protocols,
A. Lotto, F. Marchiori, A. Brighente, and M. Conti, “A survey and com- parative analysis of security properties of can authentication protocols,” IEEE Communications Surveys & Tutorials, 2024
2024
-
[4]
Sok: Kicking can down the road. systematizing can security knowledge,
K. Serag, Z. Tang, S. Kim, V . Kumar, S. Zonouz, R. Beyah, D. Xu, Z. B. Celiket al., “Sok: Kicking can down the road. systematizing can security knowledge,”arXiv preprint arXiv:2510.02960, 2025
-
[5]
Intrusion detection in the automotive domain: A comprehensive review,
B. Lampe and W. Meng, “Intrusion detection in the automotive domain: A comprehensive review,”IEEE Communications Surveys & Tutorials, vol. 25, no. 4, pp. 2356–2426, 2023
2023
-
[6]
Intrusion detection system for controller area network,
V . Tanksale, “Intrusion detection system for controller area network,” Cybersecurity, vol. 7, no. 1, p. 4, 2024
2024
-
[7]
Supervised contrastive resnet and transfer learning for the in-vehicle intrusion detection system,
T.-N. Hoang and D. Kim, “Supervised contrastive resnet and transfer learning for the in-vehicle intrusion detection system,”Expert Systems with Applications, vol. 238, p. 122181, 2024
2024
-
[8]
Intrusion detection system for automotive controller area network (can) bus system: a re- view,
S.-F. Lokman, A. T. Othman, and M.-H. Abu-Bakar, “Intrusion detection system for automotive controller area network (can) bus system: a re- view,”EURASIP Journal on Wireless Communications and Networking, vol. 2019, no. 1, pp. 1–17, 2019
2019
-
[9]
X-canids: Signal-aware explainable intrusion detection system for controller area network- based in-vehicle network,
S. Jeong, S. Lee, H. Lee, and H. K. Kim, “X-canids: Signal-aware explainable intrusion detection system for controller area network- based in-vehicle network,”IEEE Transactions on Vehicular Technology, vol. 73, no. 3, pp. 3230–3246, 2023
2023
-
[10]
Explainable intrusion detection systems (x-ids): A survey of current methods, challenges, and opportunities,
S. Neupaneet al., “Explainable intrusion detection systems (x-ids): A survey of current methods, challenges, and opportunities,”IEEE Access, vol. 10, pp. 112 392–112 415, 2022
2022
-
[11]
A survey of anomaly detection in in-vehicle networks,
¨O. ¨Ozdemir, M. T. ˙Is ¸yapar, P. Karag¨oz, K. W. Schmidt, D. Demir, and N. A. Karag ¨oz, “A survey of anomaly detection in in-vehicle networks,” arXiv preprint arXiv:2409.07505, 2024
-
[12]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
2022
-
[13]
A comprehensive anal- ysis of datasets for automotive intrusion detection systems
S. Lee, W. Choi, I. Kim, G. Lee, and D. H. Lee, “A comprehensive anal- ysis of datasets for automotive intrusion detection systems.”Computers, Materials & Continua, vol. 76, no. 3, 2023
2023
-
[14]
Entropy-based anomaly detection for in-vehicle networks,
M. M ¨uter and N. Asaj, “Entropy-based anomaly detection for in-vehicle networks,” in2011 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2011, pp. 1110–1115
2011
-
[15]
Intrusion detection system for automotive controller area network (can) bus system: A review,
S.-F. Lokman, A. T. Othman, and M.-H. Abu-Bakar, “Intrusion detection system for automotive controller area network (can) bus system: A review,”Vehicular Communications, vol. 36, p. 100481, 2022
2022
-
[16]
Can-bert: Context-aware representation learning for can bus intrusion detection,
M.-J. Kanget al., “Can-bert: Context-aware representation learning for can bus intrusion detection,”arXiv preprint arXiv:2210.09439, 2022. [Online]. Available: https://arxiv.org/abs/2210.09439
-
[17]
Real ornl automotive dynamometer (road) can intrusion dataset,
M. E. Verma, R. A. Bridges, M. D. Iannacone, S. C. Hollifield, P. Moriano, S. C. Hespeler, B. Kay, and F. L. Combs, “Real ornl automotive dynamometer (road) can intrusion dataset,” Zenodo, 2020, dataset. [Online]. Available: https://zenodo.org/records/10462796
-
[18]
Car-hacking dataset for intrusion detection systems,
Hacking and Countermeasure Research Lab, “Car-hacking dataset for intrusion detection systems,” Online dataset, 2016, accessed: 2026-02. [Online]. Available: https://ocslab.hksecurity.net/Datasets/ car-hacking-dataset
2016
-
[19]
J. Chen, A. Feng, Z. Zhao, J. Garza, G. Nurbek, C. Qin, A. Maatouk, L. Tassiulas, Y . Gao, and R. Ying, “Mtbench: A multimodal time series benchmark for temporal reasoning and question answering,”arXiv preprint arXiv:2503.16858, 2025
-
[20]
Time-mqa: Time series multi-task question answering with context enhancement, 2025
Y . Kong, Y . Yang, Y . Hwang, W. Du, S. Zohren, Z. Wang, M. Jin, and Q. Wen, “Time-mqa: Time series multi-task question answering with context enhancement,”arXiv preprint arXiv:2503.01875, 2025
-
[21]
Matthew Willetts, Sven Hollowell, Louis Aslett, Chris Holmes, and Aiden Doherty
Y . Wanget al., “Itformer: Bridging time series and natural language for multi-modal qa with large-scale multitask dataset,”arXiv preprint arXiv:2506.20093, 2025
-
[22]
Z. Xieet al., “Chatts: Aligning time series with llms via syn- thetic data for enhanced understanding and reasoning,”arXiv preprint arXiv:2412.03104, 2024. 9
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.