pith. sign in

arxiv: 2604.24935 · v1 · submitted 2026-04-27 · 💻 cs.CR · cs.LG

CAN-QA: A Question-Answering Benchmark for Reasoning over In-Vehicle CAN Traffic

Pith reviewed 2026-05-08 02:22 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords CAN busquestion answeringintrusion detectionlarge language modelsvehicle securitytemporal reasoningbenchmark datasetforensic analysis
0
0 comments X

The pith

CAN-QA turns raw vehicle network logs into natural-language questions to test how well models reason about traffic behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the first benchmark that reformulates analysis of Controller Area Network traffic as a question-answering task instead of simple classification. Raw CAN logs are split into time windows and then fed through fixed rule-based templates that automatically produce natural-language questions paired with ground-truth answers. The resulting set contains 33,128 pairs spread across ten categories that target different semantic and temporal features of the data. When large language models are tested on true/false and multiple-choice versions of these questions, they pick up basic statistical patterns yet fail on tasks that require tracking sequences across time, combining several conditions, or interpreting higher-level driving actions.

Core claim

CAN-QA converts raw CAN logs into temporally segmented windows and applies deterministic rule-based templates to generate 33,128 natural-language QA pairs across 10 categories, each targeting distinct semantic and temporal properties of CAN traffic. Evaluation of large language models shows that although these models capture superficial statistical regularities, they struggle with temporal reasoning, multi-condition inference, and higher-level behavioral interpretation.

What carries the argument

The rule-based template generator that segments CAN logs into time windows and automatically derives natural-language questions together with their ground-truth answers from the logs' semantic and temporal structure.

If this is right

  • Forensic analysis of CAN traffic can shift from label prediction to explicit question-driven reasoning that matches how analysts actually work.
  • Model weaknesses in temporal and multi-condition questions identify concrete skills that need improvement before LLMs can be trusted for vehicle security tasks.
  • The automatically generated dataset supplies a standardized, reproducible testbed for comparing future models on CAN understanding.
  • The same template-driven creation process can be reused to produce evaluation data at low cost for other log-based security domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same log-to-question conversion method could be adapted to create QA benchmarks for other time-series protocols such as industrial control systems.
  • Strong performance on CAN-QA might indicate an LLM's broader suitability for explaining anomalies in any sequential sensor stream.
  • Extending the benchmark with human-written questions could reveal whether the current templates miss important forensic scenarios.

Load-bearing premise

The deterministic rule-based templates generate questions and ground-truth answers that accurately reflect the semantic, temporal, and relational properties required for real-world forensic workflows over CAN traffic.

What would settle it

A comparison in which practicing automotive security analysts manually answer a random sample of the generated questions and their answers are checked against the benchmark's automatic ground truth for consistency on temporal and multi-condition items.

Figures

Figures reproduced from arXiv: 2604.24935 by Abhijay Deevi, Jing Chen, Onat Gungor, Tajana Rosing.

Figure 1
Figure 1. Figure 1: Advancing Beyond Binary Anomaly Detection to CAN-QA. view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the CAN-QA framework. as MTBench [19], Time-MQA [20], ITFormer [21], and ChatTS [22] demonstrate that question-answering (QA) for￾mulations provide a powerful interface for evaluating temporal understanding, contextual inference, and multi-step reasoning. By converting numerical sequences into natural-language rea￾soning tasks, these approaches support evaluation paradigms that go beyond conven… view at source ↗
Figure 3
Figure 3. Figure 3: Zero-shot accuracy on TF and MCQ across selected models. view at source ↗
Figure 4
Figure 4. Figure 4: Per-category accuracy across TF (top) and MCQ (bottom) questions. view at source ↗
Figure 5
Figure 5. Figure 5: Effect of prompting strategies on structured CAN-QA reasoning view at source ↗
read the original abstract

The Controller Area Network (CAN) is a safety-critical in-vehicle communication protocol that lacks built-in security mechanisms, making intrusion detection essential. Existing approaches predominantly formulate CAN intrusion detection as a classification task, mapping complex traffic patterns to attack labels. However, this formulation abstracts away the temporal and relational structure of CAN traffic and misaligns with real-world forensic workflows, which require systematic reasoning about traffic behavior. To address this gap, we introduce CAN-QA, the first benchmark that reformulates CAN traffic analysis as a question-answering (QA) task. CAN-QA converts raw CAN logs into temporally segmented windows and applies deterministic rule-based templates to generate natural-language questions paired with automatically derived ground-truth answers. The resulting dataset comprises 33,128 QA pairs across 10 categories, each targeting distinct semantic and temporal properties of CAN traffic. Using CAN-QA, we evaluate large language models across both True/False and multiple-choice formats. Our results indicate that, although these models capture superficial statistical regularities, they struggle with temporal reasoning, multi-condition inference, and higher-level behavioral interpretation. Our code is available at https://github.com/Kriiiiss/CAN-QA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CAN-QA, the first benchmark reformulating CAN traffic analysis as a QA task. Raw CAN logs are converted into temporally segmented windows, from which deterministic rule-based templates automatically generate 33,128 natural-language QA pairs across 10 categories targeting semantic, temporal, and relational properties of the traffic. LLM evaluations in true/false and multiple-choice formats show that models capture superficial statistical regularities but struggle with temporal reasoning, multi-condition inference, and higher-level behavioral interpretation. The code is released publicly.

Significance. If the templates are shown to faithfully require the claimed forms of reasoning without artifacts, this would be a meaningful contribution by shifting CAN security analysis from classification to a QA formulation that better matches forensic workflows. The public code and scale of the dataset provide clear reproducibility value for the automotive security and LLM evaluation communities.

major comments (2)
  1. [§3 (CAN-QA Construction)] §3 (CAN-QA Construction): The deterministic rule-based templates are asserted to generate questions and ground-truth answers that specifically test temporal reasoning, multi-condition inference, and behavioral interpretation, yet no human validation, expert review, inter-annotator agreement, or accuracy metrics on the generated pairs are reported. This is load-bearing for the headline result, because without such checks the observed LLM performance gaps cannot be confidently attributed to reasoning deficits rather than possible template shortcuts, ambiguities, or incorrect ground truths.
  2. [§4 (Experiments and Results)] §4 (Experiments and Results): The attribution of specific failure modes (temporal reasoning, multi-condition inference, behavioral interpretation) requires per-category accuracy tables, error analysis, or ablation studies showing that each category cannot be solved by superficial statistics alone; the current high-level description does not supply this evidence.
minor comments (1)
  1. [Abstract] Abstract: Adding one or two concrete performance figures (e.g., accuracy ranges or category-wise gaps) would make the claims more immediately verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects of validation and analysis that will strengthen the manuscript. We appreciate the positive assessment of the benchmark's novelty, scale, and reproducibility value. We address each major comment below and outline specific revisions.

read point-by-point responses
  1. Referee: [§3 (CAN-QA Construction)] §3 (CAN-QA Construction): The deterministic rule-based templates are asserted to generate questions and ground-truth answers that specifically test temporal reasoning, multi-condition inference, and behavioral interpretation, yet no human validation, expert review, inter-annotator agreement, or accuracy metrics on the generated pairs are reported. This is load-bearing for the headline result, because without such checks the observed LLM performance gaps cannot be confidently attributed to reasoning deficits rather than possible template shortcuts, ambiguities, or incorrect ground truths.

    Authors: We agree that explicit validation metrics would increase confidence in the benchmark. The templates are fully deterministic, with ground-truth answers computed directly from the segmented CAN windows using protocol-defined rules (e.g., exact temporal ordering checks and multi-signal conjunctions), eliminating ambiguity by construction. However, we acknowledge the absence of reported human review or inter-annotator agreement in the current version. In revision, we will add: (1) a detailed appendix with template pseudocode and example generations for each of the 10 categories, (2) results from a small-scale expert review (two automotive security researchers manually verifying 200 randomly sampled pairs for correctness and reasoning type), and (3) a quantitative check confirming 100% template-to-answer fidelity on the sampled set. This will directly address potential shortcuts or errors. revision: partial

  2. Referee: [§4 (Experiments and Results)] §4 (Experiments and Results): The attribution of specific failure modes (temporal reasoning, multi-condition inference, behavioral interpretation) requires per-category accuracy tables, error analysis, or ablation studies showing that each category cannot be solved by superficial statistics alone; the current high-level description does not supply this evidence.

    Authors: We concur that granular evidence is needed to rigorously attribute the observed gaps. The current manuscript reports aggregate accuracies and provides qualitative failure examples. For the revision, we will expand §4 with: (1) per-category accuracy tables for both True/False and multiple-choice formats across all evaluated models, (2) a systematic error analysis breaking down mistakes by type (temporal, multi-condition, behavioral) with representative examples, and (3) ablation studies on a subset of categories, including performance on temporally shuffled windows and on questions with reduced conditions, to demonstrate that models cannot rely on superficial co-occurrence statistics alone for the harder categories. These additions will be supported by new figures and tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity: benchmark generation and evaluation remain independent

full rationale

The paper derives the CAN-QA dataset by applying fixed deterministic rule-based templates to raw CAN logs, producing questions and ground-truth answers by construction from the input traffic windows. Model performance is then measured against this fixed dataset in a separate evaluation step. No equations, fitted parameters, or predictions reduce to the inputs by definition; no load-bearing self-citations appear; and the central claim (LLM limitations on temporal/multi-condition reasoning) is an empirical observation on the generated data rather than a tautology. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on the validity of rule-based QA generation without introducing free parameters or new entities; the main unverified element is the fidelity of templates to forensic reality.

axioms (1)
  • domain assumption Deterministic rule-based templates can produce QA pairs that capture distinct semantic and temporal properties of CAN traffic.
    Invoked in the dataset creation process described in the abstract.

pith-pipeline@v0.9.0 · 5520 in / 1179 out tokens · 58467 ms · 2026-05-08T02:22:37.323207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 8 canonical work pages

  1. [1]

    In-vehicle networks outlook: Achievements and challenges,

    W. Zeng, M. A. Khalid, and S. Chowdhury, “In-vehicle networks outlook: Achievements and challenges,”IEEE Communications Surveys & Tutorials, vol. 18, no. 3, pp. 1552–1571, 2016

  2. [2]

    {WIP}: Intrusion detection and localization for{CAN} by extracting propagation delay features from message intervals,

    Z. Tanget al., “{WIP}: Intrusion detection and localization for{CAN} by extracting propagation delay features from message intervals,” in3rd USENIX Symposium on Vehicle Security and Privacy (VehicleSec 25), 2025, pp. 19–26

  3. [3]

    A survey and com- parative analysis of security properties of can authentication protocols,

    A. Lotto, F. Marchiori, A. Brighente, and M. Conti, “A survey and com- parative analysis of security properties of can authentication protocols,” IEEE Communications Surveys & Tutorials, 2024

  4. [4]

    Sok: Kicking can down the road. systematizing can security knowledge,

    K. Serag, Z. Tang, S. Kim, V . Kumar, S. Zonouz, R. Beyah, D. Xu, Z. B. Celiket al., “Sok: Kicking can down the road. systematizing can security knowledge,”arXiv preprint arXiv:2510.02960, 2025

  5. [5]

    Intrusion detection in the automotive domain: A comprehensive review,

    B. Lampe and W. Meng, “Intrusion detection in the automotive domain: A comprehensive review,”IEEE Communications Surveys & Tutorials, vol. 25, no. 4, pp. 2356–2426, 2023

  6. [6]

    Intrusion detection system for controller area network,

    V . Tanksale, “Intrusion detection system for controller area network,” Cybersecurity, vol. 7, no. 1, p. 4, 2024

  7. [7]

    Supervised contrastive resnet and transfer learning for the in-vehicle intrusion detection system,

    T.-N. Hoang and D. Kim, “Supervised contrastive resnet and transfer learning for the in-vehicle intrusion detection system,”Expert Systems with Applications, vol. 238, p. 122181, 2024

  8. [8]

    Intrusion detection system for automotive controller area network (can) bus system: a re- view,

    S.-F. Lokman, A. T. Othman, and M.-H. Abu-Bakar, “Intrusion detection system for automotive controller area network (can) bus system: a re- view,”EURASIP Journal on Wireless Communications and Networking, vol. 2019, no. 1, pp. 1–17, 2019

  9. [9]

    X-canids: Signal-aware explainable intrusion detection system for controller area network- based in-vehicle network,

    S. Jeong, S. Lee, H. Lee, and H. K. Kim, “X-canids: Signal-aware explainable intrusion detection system for controller area network- based in-vehicle network,”IEEE Transactions on Vehicular Technology, vol. 73, no. 3, pp. 3230–3246, 2023

  10. [10]

    Explainable intrusion detection systems (x-ids): A survey of current methods, challenges, and opportunities,

    S. Neupaneet al., “Explainable intrusion detection systems (x-ids): A survey of current methods, challenges, and opportunities,”IEEE Access, vol. 10, pp. 112 392–112 415, 2022

  11. [11]

    A survey of anomaly detection in in-vehicle networks,

    ¨O. ¨Ozdemir, M. T. ˙Is ¸yapar, P. Karag¨oz, K. W. Schmidt, D. Demir, and N. A. Karag ¨oz, “A survey of anomaly detection in in-vehicle networks,” arXiv preprint arXiv:2409.07505, 2024

  12. [12]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  13. [13]

    A comprehensive anal- ysis of datasets for automotive intrusion detection systems

    S. Lee, W. Choi, I. Kim, G. Lee, and D. H. Lee, “A comprehensive anal- ysis of datasets for automotive intrusion detection systems.”Computers, Materials & Continua, vol. 76, no. 3, 2023

  14. [14]

    Entropy-based anomaly detection for in-vehicle networks,

    M. M ¨uter and N. Asaj, “Entropy-based anomaly detection for in-vehicle networks,” in2011 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2011, pp. 1110–1115

  15. [15]

    Intrusion detection system for automotive controller area network (can) bus system: A review,

    S.-F. Lokman, A. T. Othman, and M.-H. Abu-Bakar, “Intrusion detection system for automotive controller area network (can) bus system: A review,”Vehicular Communications, vol. 36, p. 100481, 2022

  16. [16]

    Can-bert: Context-aware representation learning for can bus intrusion detection,

    M.-J. Kanget al., “Can-bert: Context-aware representation learning for can bus intrusion detection,”arXiv preprint arXiv:2210.09439, 2022. [Online]. Available: https://arxiv.org/abs/2210.09439

  17. [17]

    Real ornl automotive dynamometer (road) can intrusion dataset,

    M. E. Verma, R. A. Bridges, M. D. Iannacone, S. C. Hollifield, P. Moriano, S. C. Hespeler, B. Kay, and F. L. Combs, “Real ornl automotive dynamometer (road) can intrusion dataset,” Zenodo, 2020, dataset. [Online]. Available: https://zenodo.org/records/10462796

  18. [18]

    Car-hacking dataset for intrusion detection systems,

    Hacking and Countermeasure Research Lab, “Car-hacking dataset for intrusion detection systems,” Online dataset, 2016, accessed: 2026-02. [Online]. Available: https://ocslab.hksecurity.net/Datasets/ car-hacking-dataset

  19. [19]

    Mtbench: A multimodal time series benchmark for temporal reasoning and question answering.arXiv preprint arXiv:2503.16858, 2025

    J. Chen, A. Feng, Z. Zhao, J. Garza, G. Nurbek, C. Qin, A. Maatouk, L. Tassiulas, Y . Gao, and R. Ying, “Mtbench: A multimodal time series benchmark for temporal reasoning and question answering,”arXiv preprint arXiv:2503.16858, 2025

  20. [20]

    Time-mqa: Time series multi-task question answering with context enhancement, 2025

    Y . Kong, Y . Yang, Y . Hwang, W. Du, S. Zohren, Z. Wang, M. Jin, and Q. Wen, “Time-mqa: Time series multi-task question answering with context enhancement,”arXiv preprint arXiv:2503.01875, 2025

  21. [21]

    Matthew Willetts, Sven Hollowell, Louis Aslett, Chris Holmes, and Aiden Doherty

    Y . Wanget al., “Itformer: Bridging time series and natural language for multi-modal qa with large-scale multitask dataset,”arXiv preprint arXiv:2506.20093, 2025

  22. [22]

    Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning.arXiv preprint arXiv:2412.03104, 2024

    Z. Xieet al., “Chatts: Aligning time series with llms via syn- thetic data for enhanced understanding and reasoning,”arXiv preprint arXiv:2412.03104, 2024. 9