BATON: A Multimodal Benchmark for Bidirectional Automation Transition Observation in Naturalistic Driving
Pith reviewed 2026-05-10 17:10 UTC · model grok-4.3
The pith
Predicting when drivers hand over or take back control from automation requires combining cabin video, road video, vehicle signals, and route data rather than video alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BATON supplies a closed-loop multimodal record of naturalistic bidirectional automation transitions and benchmark evaluations demonstrate that visual inputs alone are insufficient for reliable prediction while fused CAN and route-context signals supply complementary information, with takeover events developing more gradually and benefiting from longer horizons than handover events.
What carries the argument
The BATON dataset's synchronized multimodal streams that form a closed-loop record around each control transition event.
If this is right
- Designers can create proactive HMIs that use multimodal predictions to prepare drivers before transitions occur.
- Takeover alerts should use longer prediction windows because these events build gradually.
- Handover alerts can rely on shorter, immediate contextual cues because those events depend on sudden signals.
- Avoiding video-only systems reduces risks of over-reliance or delayed intervention in assisted driving.
Where Pith is reading between the lines
- The asymmetry finding could guide asymmetric alert designs that treat engagement and disengagement differently.
- The dataset structure might transfer to studying shared control in other vehicles or robotic systems.
- Testing the same multimodal fusion on data from varied weather, traffic densities, or driver demographics would check how general the complementarity holds.
Load-bearing premise
The 127 drivers and their drives accurately represent how people typically use automation in everyday conditions.
What would settle it
A new set of drives or models in which adding CAN bus and route signals produces no accuracy gain over video-only baselines on the handover and takeover prediction tasks.
Figures
read the original abstract
Existing driving automation (DA) systems on production vehicles rely on human drivers to decide when to engage DA while requiring them to remain continuously attentive and ready to intervene. This design demands substantial situational judgment and imposes significant cognitive load, leading to steep learning curves, suboptimal user experience, and safety risks from both over-reliance and delayed takeover. Predicting when drivers hand over control to DA and when they take it back is therefore critical for designing proactive, context-aware HMI, yet existing datasets rarely capture the multimodal context, including road scene, driver state, vehicle dynamics, and route environment. To fill this gap, we introduce BATON, a large-scale naturalistic dataset capturing real-world DA usage across 127 drivers, and 136.6 hours of driving. The dataset synchronizes front-view video, in-cabin video, decoded CAN bus signals, radar-based lead-vehicle interaction, and GPS-derived route context, forming a closed-loop multimodal record around each control transition. We define three benchmark tasks: driving action understanding, handover prediction, and takeover prediction, and evaluate baselines spanning sequence models, classical classifiers, and zero-shot VLMs. Results show that visual input alone is insufficient for reliable transition prediction: front-view video captures road context but not driver state, while in-cabin video reflects driver readiness but not the external scene. Incorporating CAN and route-context signals substantially improves performance over video-only settings, indicating strong complementarity across modalities. We further find takeover events develop more gradually and benefit from longer prediction horizons, whereas handover events depend more on immediate contextual cues, revealing an asymmetry with direct implications for HMI design in assisted driving systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the BATON dataset, a large-scale multimodal collection from 127 drivers and 136.6 hours of naturalistic driving that synchronizes front-view video, in-cabin video, CAN bus signals, radar data, and GPS route context around automation control transitions. It defines three benchmark tasks—driving action understanding, handover prediction, and takeover prediction—and evaluates baselines including sequence models, classical classifiers, and zero-shot vision-language models. The key empirical findings are that visual modalities alone are insufficient for reliable transition prediction due to complementary information needs, that adding CAN and route signals substantially improves performance, and that there is an asymmetry where takeover events benefit from longer prediction horizons while handover events rely on immediate cues.
Significance. If the results hold, this work provides a valuable resource for the human-computer interaction and automated driving communities by enabling research on proactive, context-aware human-machine interfaces. The demonstrated modality complementarity and prediction asymmetry offer concrete insights for improving safety and user experience in production driving automation systems, addressing limitations in existing datasets that lack synchronized multimodal transition data.
major comments (1)
- [§5] §5, Experiments and results: The central claims on modality complementarity and prediction asymmetry rest on the reported baseline improvements, yet the manuscript provides no quantitative metrics (e.g., F1 scores, AUC, or accuracy deltas), number of labeled transition events, or class-balance statistics; without these, it is impossible to evaluate whether the gains are statistically meaningful or robust to data characteristics.
minor comments (3)
- [§3.1] §3.1, Dataset collection: The description of driver recruitment and consent procedures is high-level; adding details on inclusion criteria, demographic distribution, and any IRB approval reference would strengthen the claim of naturalistic coverage.
- [§4.2] §4.2, Task definitions: The handover and takeover prediction tasks are clearly motivated, but the exact temporal windows and labeling rules for 'gradual' vs. 'immediate' events are not formalized; a precise definition (e.g., via pseudocode or decision tree) would aid reproducibility.
- [Figure 3 and Table 2] Figure 3 and Table 2: The modality-ablation results would benefit from error bars or statistical significance tests between video-only and multimodal conditions to support the 'substantially improves' claim.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects for strengthening the presentation of our experimental results. We address the major comment point by point below.
read point-by-point responses
-
Referee: §5, Experiments and results: The central claims on modality complementarity and prediction asymmetry rest on the reported baseline improvements, yet the manuscript provides no quantitative metrics (e.g., F1 scores, AUC, or accuracy deltas), number of labeled transition events, or class-balance statistics; without these, it is impossible to evaluate whether the gains are statistically meaningful or robust to data characteristics.
Authors: We agree that the current version of the manuscript does not provide the specific quantitative metrics (F1 scores, AUC, accuracy deltas), the exact count of labeled transition events, or class-balance statistics in §5. This limits the ability to fully assess the statistical significance and robustness of the modality complementarity and prediction asymmetry findings. In the revised manuscript, we will add a detailed results table in §5 reporting these metrics for all baselines (sequence models, classical classifiers, and zero-shot VLMs) across modality ablations. We will also include the total number of handover and takeover events identified in the 136.6 hours of data from 127 drivers, along with class distribution statistics and any relevant significance testing. These additions will directly support evaluation of the reported improvements. revision: yes
Circularity Check
No significant circularity
full rationale
This is an empirical dataset collection and benchmarking paper with no mathematical derivations, fitted parameters, or self-referential predictions. It introduces the BATON dataset from 127 drivers, defines three tasks (driving action understanding, handover prediction, takeover prediction), synchronizes multimodal streams (front-view video, in-cabin video, CAN, radar, GPS), and reports baseline comparisons across sequence models, classifiers, and VLMs. Central claims about visual insufficiency, CAN/route complementarity, and handover/takeover horizon asymmetry rest directly on these empirical results without reduction to inputs by construction, self-citation load-bearing, or ansatz smuggling. No equations or uniqueness theorems appear; the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
John L. Campbell, James L. Brown, Justin S. Graving, Christian M. Richard, Monica G. Lichty, L. Paige Bacon, Justin F. Morgan, Hong Li, Diane N. Williams, and Thomas Sanquist. 2018.Human Factors Design Guidance for Level 2 and Level 3 Automated Driving Concepts. Technical Report DOT HS 812 555. National Highway Traffic Safety Administration. https://www.n...
work page 2018
-
[2]
comma.ai. 2018. Safety and Driver Attention. https://blog.comma.ai/safety-and- driver-attention/. Accessed: 2026-04-02
work page 2018
-
[3]
comma.ai. 2023. Introducing the comma 3X. https://blog.comma.ai/comma3X/. Accessed: 2026-02-25
work page 2023
-
[4]
comma.ai. 2025. Terms & Privacy. https://comma.ai/terms. Accessed: 2026-04-02
work page 2025
-
[5]
Khazar Dargahi Nobari and Torsten Bertram. 2024. A Multimodal Driver Moni- toring Benchmark Dataset for Driver Modeling in Assisted Driving Automation. Scientific Data11 (2024), 327. doi:10.1038/s41597-024-03137-y
-
[6]
Alexander Eriksson and Neville A. Stanton. 2017. Take-over Time in Highly Automated Vehicles: Noncritical Transitions to and from Manual Control.Human Factors59, 4 (2017), 689–705. doi:10.1177/0018720816685832
-
[7]
2025.Assessment of Advanced Driver Assistance and Dynamic Control Assistance Systems (ADAS/DCAS)
FIA European Bureau. 2025.Assessment of Advanced Driver Assistance and Dynamic Control Assistance Systems (ADAS/DCAS). Final Report. FIA Euro- pean Bureau. https://www.fiaregion1.com/wp-content/uploads/2026/01/Final_ Report_ADAS_DCAS_FIA_2025.pdf
work page 2025
-
[8]
Christian Gold, Moritz Körber, David Lechner, and Klaus Bengler. 2016. Taking Over Control From Highly Automated Vehicles in Complex Traffic Situations: The Role of Traffic Density.Human Factors58, 4 (2016), 642–652. doi:10.1177/ 0018720816634226
work page 2016
-
[9]
Jiwoo Hwang, Woohyeok Choi, Jungmin Lee, Woojoo Kim, Jungwook Rhim, and Auk Kim. 2025. A Dataset on Takeover During Distracted L2 Automated Driving. Scientific Data12 (2025), 539. doi:10.1038/s41597-025-04781-8
-
[10]
Marzban, Tiancheng Hu, Mohamed H
Sumit Jha, Mohamed F. Marzban, Tiancheng Hu, Mohamed H. Mahmoud, Naofal Al-Dhahir, and Carlos Busso. 2021. The Multimodal Driver Monitoring Database: A Naturalistic Corpus to Study Driver Attention.arXiv preprint arXiv:2101.04639 (2021). doi:10.48550/arXiv.2101.04639
-
[11]
Lesong Jia and Na Du. 2024. Driver Situational Awareness Prediction During Takeover Transitions: A Multimodal Machine Learning Approach. InProceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 68. 885–887. doi:10.1177/10711813241275904
-
[12]
Okan Kopuklu, Jiapeng Zheng, Hang Xu, and Gerhard Rigoll. 2021. Driver Anom- aly Detection: A Dataset and Contrastive Learning Approach. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 91–100
work page 2021
-
[13]
Gihun Lee, Kahyun Lee, and Jong-Uk Hou. 2025. Classifying Advanced Driver Assistance System (ADAS) Activation from Multimodal Driving Data: A Real- World Study.Sensors25, 19 (2025), 6139. doi:10.3390/s25196139
-
[14]
Zhenji Lu, Riender Happee, Christopher D. D. Cabrall, Miltos Kyriakidis, and Joost C. F. de Winter. 2016. Human Factors of Transitions in Automated Driving: A General Framework and Literature Survey.Transportation Research Part F: Traffic Psychology and Behaviour43 (2016), 183–198. doi:10.1016/j.trf.2016.10.007
-
[15]
Manuel Martin, Alina Roitberg, Monica Haurilet, Matthias Horne, Simon Reiss, Michael Voit, and Rainer Stiefelhagen. 2019. Drive&Act: A Multi-Modal Dataset for Fine-Grained Driver Behavior Recognition in Autonomous Vehicles. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2801–2810
work page 2019
-
[16]
Natasha Merat, A. Hamish Jamson, Frank C. H. Lai, Michael Daly, and Oliver M. J. Carsten. 2014. Transition to Manual: Driver Behaviour When Resuming Control from a Highly Automated Vehicle.Transportation Research Part F: Traffic Psychology and Behaviour27 (2014), 274–282. doi:10.1016/j.trf.2014.09.005
-
[17]
National Highway Traffic Safety Administration. [n. d.]. Driver Assistance Tech- nologies. https://www.nhtsa.gov/vehicle-safety/driver-assistance-technologies. Accessed: 2026-03-27
work page 2026
-
[18]
Oppelt, Andreas Foltyn, Jessica Deuschel, Nadine R
Maximilian P. Oppelt, Andreas Foltyn, Jessica Deuschel, Nadine R. Lang, Nina Holzer, Bjoern M. Eskofier, and Seung Hee Yang. 2023. ADABase: A Multimodal Dataset for Cognitive Load Estimation.Sensors23, 1 (2023), 340. doi:10.3390/ s23010340
work page 2023
-
[19]
Erfan Pakdamanian, Shili Sheng, Sonia Baee, Seongkook Heo, Sarit Kraus, and Lu Feng. 2021. DeepTake: Prediction of Driver Takeover Behavior Using Multimodal Data. InCHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 378, 14 pages. doi:10.1145/3411764.3445563
-
[20]
Vasili Ramanishka, Yi-Ting Chen, Teruhisa Misu, and Kate Saenko. 2018. Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2018
-
[21]
Russell, Jon Atwood, and Shane B
Sheldon M. Russell, Jon Atwood, and Shane B. McLaughlin. 2021.Driver Ex- pectations for System Control Errors, Driver Engagement, and Crash A voidance in Level 2 Driving Automation Systems. Technical Report DOT HS 812 982. National Highway Traffic Safety Administration. doi:10.21949/1530205 7 Wang and Zhou
-
[22]
Mohamed Sabry, Walter Morales-Alvarez, and Cristina Olaverri-Monreal. 2024. Automated Vehicle Driver Monitoring Dataset from Real-World Scenarios. In 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC). 1545–1550. doi:10.1109/ITSC58415.2024.10920048
-
[23]
Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. InProceedings of the 36th International Confer- ence on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 6105–6114
work page 2019
-
[24]
Yuhang Wang, Abdulaziz Alhuraish, Shengming Yuan, and Hao Zhou. 2025. OpenLKA: An Open Dataset of Lane Keeping Assist from Production Vehicles Under Real-World Driving Conditions. In2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 4669–4676
work page 2025
-
[25]
Yantong Wang, Yu Gu, Tong Quan, Jiaoyun Yang, Mianxiong Dong, Ning An, and Fuji Ren. 2025. ViE-Take: A Vision-Driven Multi-Modal Dataset for Exploring the Emotional Landscape in Takeover Safety of Autonomous Driving.Research 8 (2025), 0603. doi:10.34133/research.0603
-
[26]
Yuhang Wang, Yiyao Xu, Jingran Sun, and Hao Zhou. 2026. ADAS-TO: A Large- Scale Multimodal Naturalistic Dataset and Empirical Characterization of Human Takeovers during ADAS Engagement. arXiv:2603.06986 doi:10.48550/arXiv.2603. 06986
-
[27]
Tong Wu, Nikolas Martelaro, Simon Stent, Jorge Ortiz, and Wendy Ju. 2021. Learning When Agents Can Talk to Drivers Using the INAGT Dataset and Multisensor Fusion.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies5, 3, Article 133 (Sept. 2021), 28 pages. doi:10.1145/ 3478125
work page 2021
-
[28]
Dingkang Yang, Shuai Huang, Zhi Xu, Zhenpeng Li, Shunli Wang, Mingcheng Li, Yuzheng Wang, Yang Liu, Kun Yang, Zhaoyu Chen, Yan Wang, Jing Liu, Peixuan Zhang, Peng Zhai, and Lihua Zhang. 2023. AIDE: A Vision-Driven Multi- View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception. In Proceedings of the IEEE/CVF International Conference on Co...
work page 2023
-
[29]
Bo Zhang, Joost C. F. de Winter, Silvia F. Varotto, Riender Happee, and Marieke Martens. 2019. Determinants of Take-over Time from Automated Driving: A Meta-analysis of 129 Studies.Transportation Research Part F: Traffic Psychology and Behaviour64 (2019), 285–307. doi:10.1016/j.trf.2019.04.020 8
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.