Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

Alejandra Vidrio; Angel Martinez-Sanchez; Erika Maquiling; Giovanni Tapia Lopez; Jake Rattigan; Laura Fleig; Maitrayee Keskar; Mira Sur; Mohan Trivedi; Parthib Roy

arxiv: 2602.07668 · v2 · submitted 2026-02-07 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

Ross Greer , Laura Fleig , Maitrayee Keskar , Erika Maquiling , Giovanni Tapia Lopez , Angel Martinez-Sanchez , Parthib Roy , Jake Rattigan

show 4 more authors

Mira Sur Alejandra Vidrio Thomas Marcotte Mohan Trivedi

This is my paper

Pith reviewed 2026-05-16 05:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords multimodal fusiondriver state assessmentaudio sensingvehicle safetyL-LIO frameworkintelligent vehiclessensor fusionimpairment detection

0 comments

The pith

Adding audio signals to visual sensing forms the L-LIO framework that improves driver state assessment and vehicle environment understanding through multimodal fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the existing LILO visual framework by incorporating audio inputs to create the L-LIO system. This multimodal approach targets better evaluation of driver conditions such as impairment and clearer understanding of surrounding scenes. The authors present three pilot cases using custom real-world audio datasets: classifying driver speech for intoxication or similar states, processing passenger spoken instructions for vehicle planning, and using sound to clarify external agents' gestures when vision alone is ambiguous. They report that audio supplies useful safety information in nuanced situations where visual data falls short. The work notes ongoing challenges with noise, privacy, and consistency across people but positions the fused sensing as a route to stronger safety interventions.

Core claim

The central claim is that expanding the looking-in-looking-out framework with audio signals produces the looking-and-listening inside-and-outside framework, which strengthens driver state assessment and environment understanding via multimodal sensor fusion, as shown in pilot evaluations of speech-based impairment classification, natural-language passenger instructions, and audio disambiguation of external guidance.

What carries the argument

The L-LIO framework, which fuses audio and visual signals collected inside and outside the vehicle to support safety-relevant decisions.

If this is right

Supervised models trained on driver speech audio can classify states such as intoxication.
Passenger spoken instructions can be collected and aligned to guide vehicle planning systems.
Audio cues can resolve ambiguities in external agents' gestures and guidance that vision-only systems miss.
Multimodal fusion of audio and visual data opens new paths for safety interventions in intelligent vehicles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to passenger monitoring and external pedestrian interaction in shared autonomy settings.
Systems built on L-LIO may require separate privacy-preserving audio processing pipelines to reach widespread use.
Real-time fusion algorithms would need testing against varying cabin and road noise profiles before deployment.
The framework suggests a general pattern for adding sound-based channels to other vision-centric vehicle perception tasks.

Load-bearing premise

Audio signals can supply reliable safety insights in real-world noisy conditions without major interference, privacy conflicts, or loss of performance across different people.

What would settle it

A controlled test showing that audio-based classification of driver impairment or external guidance performs at chance level in typical driving noise levels or across varied speakers.

read the original abstract

The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The L-LIO idea adds audio to the existing LILO framework in sensible places, but the paper gives only qualitative pilot descriptions with no numbers or baselines to show any actual gain.

read the letter

The paper's main contribution is extending the authors' prior LILO work by folding in audio signals to create the L-LIO framework. They outline three concrete cases where sound could help: classifying driver speech for impairment states like intoxication, capturing passenger spoken instructions to feed into planning, and using audio to resolve ambiguous gestures or guidance from people outside the vehicle. These scenarios are reasonable extensions of vision-only limits, and the write-up does a clear job naming where audio might supply context that cameras miss. It also notes real deployment hurdles such as ambient noise, privacy, and cross-subject robustness without pretending they are solved. That part shows honest engagement with the problem space. The weakness is that none of the claims are backed by data. The description mentions custom in-vehicle and external audio samples plus pilot findings, yet supplies no classification accuracies, instruction success rates, disambiguation metrics, dataset sizes, model architectures, or direct comparisons against vision-only baselines. Without those, the assertion that audio yields safety-relevant insights stays at the level of assertion rather than evidence. The work is aimed at researchers already tracking multimodal sensing for driver monitoring and autonomous vehicle decision systems. Someone looking for implemented methods or reproducible results will find little to use directly. I would send it for peer review so the authors can add the missing quantitative evaluations and controls; the underlying direction is worth testing even if the current version is thin on support.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes extending the looking-in-looking-out (LILO) framework to a looking-and-listening inside-and-outside (L-LIO) framework by adding audio signals for multimodal fusion. It claims this enhances driver state assessment (e.g., speech impairment classification), passenger natural language instructions for planning, and disambiguation of external agents, supported by qualitative pilot findings on custom in-vehicle and external audio datasets collected in real-world environments.

Significance. If the pilot cases were supported by quantitative metrics and baselines, L-LIO could meaningfully advance multimodal safety systems in intelligent vehicles by addressing vision-only limitations in context-rich scenarios. As presented, the conceptual framing is clear but the absence of verifiable results limits its contribution to the literature on driver monitoring and autonomous decision-making.

major comments (2)

[Evaluation of example cases] The three example cases in the evaluation section are described only qualitatively (driver speech impairment, passenger instructions such as 'turn after that red building', and audio disambiguation of external agents) with no reported accuracies, F1 scores, success rates, dataset sizes, collection protocols, model architectures, or comparisons to vision-only baselines. This directly undermines the central claim that audio 'yields safety-relevant insights' and 'enhances' assessment.
[Challenges and future work] The abstract and challenges paragraph mention ambient noise interference, privacy considerations, and robustness across subjects as open issues, yet no experiments, noise-handling methods, or subject-variability tests are provided to assess whether audio can reliably augment safety in dynamic real-world conditions.

minor comments (2)

[Abstract] The abstract states 'pilot findings show that audio yields safety-relevant insights' without enumerating what those specific findings are or referencing any supporting table or figure.
[Framework description] Notation for the proposed L-LIO framework is introduced at a high level but without a diagram or pseudocode clarifying the multimodal fusion architecture relative to the original LILO.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript proposing the L-LIO framework. We address each major comment below and outline the revisions we will make to better scope our claims and clarify the preliminary nature of the work.

read point-by-point responses

Referee: [Evaluation of example cases] The three example cases in the evaluation section are described only qualitatively (driver speech impairment, passenger instructions such as 'turn after that red building', and audio disambiguation of external agents) with no reported accuracies, F1 scores, success rates, dataset sizes, collection protocols, model architectures, or comparisons to vision-only baselines. This directly undermines the central claim that audio 'yields safety-relevant insights' and 'enhances' assessment.

Authors: We appreciate this observation and agree that the evaluations are qualitative. The manuscript is positioned as a conceptual proposal for extending LILO to L-LIO, using three illustrative pilot cases collected in real-world environments to motivate the framework rather than to deliver a full empirical study with benchmarks. Dataset sizes and collection protocols for the custom in-vehicle and external audio samples are described in the text, but no quantitative metrics or vision-only baselines were computed. We will revise the evaluation section to explicitly frame these as preliminary qualitative examples, tone down claims of enhancement to 'potential' insights, and state that quantitative comparisons are reserved for future work. This will align the presentation with the evidence provided. revision: partial
Referee: [Challenges and future work] The abstract and challenges paragraph mention ambient noise interference, privacy considerations, and robustness across subjects as open issues, yet no experiments, noise-handling methods, or subject-variability tests are provided to assess whether audio can reliably augment safety in dynamic real-world conditions.

Authors: We agree that the challenges are presented without accompanying experiments or methods. As the manuscript focuses on introducing the multimodal framework and motivating its use via pilots, detailed validation of noise robustness or subject variability falls outside the current scope. We will revise the abstract and challenges paragraph to more clearly label these as open issues for future research and briefly outline example directions, such as adaptive filtering for noise or multi-subject data collection protocols, without claiming any current solutions. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual framework proposal without derivations or self-referential reductions

full rationale

The manuscript proposes the L-LIO framework as a multimodal extension of the existing LILO framework by adding audio signals for driver state assessment and scene understanding. No equations, parameters, or quantitative predictions appear anywhere in the text. The three pilot cases (speech impairment classification, passenger instructions, and external agent disambiguation) are described qualitatively with no fitted models, no performance metrics, and no claims that a derived quantity equals an input by construction. LILO is referenced as prior work but is not used to justify any uniqueness theorem or ansatz within this paper; the central claim remains a high-level suggestion for sensor fusion rather than a derived result. The absence of any load-bearing mathematical or predictive step means the derivation chain is empty and the proposal is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests primarily on domain assumptions about the utility of audio in vehicle contexts, with no free parameters or formal derivations; the L-LIO framework itself is the main invented entity.

axioms (1)

domain assumption Audio signals provide safety-relevant information not captured by vision alone in driver and scene understanding tasks.
Invoked throughout the proposal of L-LIO and the three example cases in the abstract.

invented entities (1)

L-LIO framework no independent evidence
purpose: To integrate audio sensing with visual data for enhanced multimodal driver state assessment and environment understanding.
Newly proposed augmentation to the prior LILO framework.

pith-pipeline@v0.9.0 · 5647 in / 1260 out tokens · 68388 ms · 2026-05-16T05:53:54.738441+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages

[1]

Opensmile: the munich versatile and fast open-source audio feature extractor,

F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462

work page 2010
[2]

The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015

work page 2015
[3]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

work page 2020
[4]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[5]

Murphy-Chutorian, E., & Trivedi, M. M. (2008). Head pose estimation in computer vision: A survey. IEEE transactions on pattern analysis and machine intelligence, 31(4), 607-626

work page 2008
[6]

Vicente, F., Huang, Z., Xiong, X., De la Torre, F., Zhang, W., & Levi, D. (2015). Driver gaze tracking and eyes off the road detection system. IEEE Transactions on Intelligent Transportation Systems, 16(4), 2014-2027

work page 2015
[7]

R., & Strayer, D

Lohani, M., Payne, B. R., & Strayer, D. L. (2019). A review of psychophysiological measures to assess cognitive states in real-world driving. Frontiers in human neuroscience, 13, 57

work page 2019
[8]

U., Awan, S

Ramzan, M., Khan, H. U., Awan, S. M., Ismail, A., Ilyas, M., & Mahmood, A. (2019). A survey on state-of-the-art drowsiness detection techniques. IEEE Access, 7, 61904-61919

work page 2019
[9]

Kashevnik, A., Shchedrin, R., Kaiser, C., & Stocker, A. (2021). Driver distraction detection methods: A literature review and framework. IEEE Access, 9, 60063-60076

work page 2021
[10]

Paxion, J., Galy, E., & Berthelon, C. (2014). Mental workload and driving. Frontiers in psychology, 5, 1344

work page 2014
[11]

Deo, N., & Trivedi, M. M. (2019). Looking at the driver/rider in autonomous vehicles to predict take-over readiness. IEEE Transactions on Intelligent Vehicles, 5(1), 41-52. Greer 17

work page 2019
[12]

Greer, R., Deo, N., Rangesh, A., Gunaratne, P., & Trivedi, M. (2023). Safe control transitions: Machine vision based observable readiness index and data-driven takeover time prediction. International Technical Symposium on the Enhanced Safety of Vehicles (ESV)

work page 2023
[13]

doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision–Language Navigation,

Roy, P., Perisetla, S., Shriram, S., Krishnaswamy, H., Keskar, A., and Greer, R., “doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision–Language Navigation,” arXiv preprint arXiv:2412.05893 , 2024

work page arXiv 2024
[14]

nuScenes: A multimodal dataset for autonomous driving,

H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[15]

S., Ali, S

Al-Quraishi, M. S., Ali, S. S. A., Muhammad, A. Q., Tang, T. B., & Elferik, S. (2024). Technologies for detecting and monitoring drivers' states: A systematic review. Heliyon, 10(20)

work page 2024
[16]

Sahayadhas, A., Sundaraj, K., & Murugappan, M. (2012). Detecting driver drowsiness based on sensors: a review. Sensors, 12(12), 16937-16953

work page 2012
[17]

Nair, A., Patil, V., Nair, R., Shetty, A., & Cherian, M. (2024). A review on recent driver safety systems and its emerging solutions. International Journal of Computers and Applications, 46(3), 137-151

work page 2024
[18]

(2017, February)

Zhang, F., Su, J., Geng, L., & Xiao, Z. (2017, February). Driver fatigue detection based on eye state recognition. In 2017 International Conference on Machine Vision and Information Technology (CMVIT) (pp. 105-110). IEEE

work page 2017
[19]

H., Pourshahabi, M

Sigari, M. H., Pourshahabi, M. R., Soryani, M., & Fathy, M. (2014). A review on driver face monitoring systems for fatigue and distraction detection. International Journal of Advanced Science and Technology, 64, 73-100

work page 2014
[20]

Suffoletto, B., Anwar, A., Glaister, S., & Sejdic, E. (2023). Detection of alcohol intoxication using voice features: a controlled laboratory study. Journal of studies on alcohol and drugs, 84(6), 808-813

work page 2023
[21]

(2008, March)

Yin, B., Chen, F., Ruiz, N., & Ambikairajah, E. (2008, March). Speech-based cognitive load monitoring system. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 2041-2044). IEEE

work page 2008
[22]

Van Puyvelde, M., Neyt, X., McGlone, F., & Pattyn, N. (2018). Voice stress analysis: A new framework for voice and effort in human performance. Frontiers in psychology, 9, 1994

work page 2018
[23]

Amato, F., Cesarini, V., Olmo, G., Saggio, G., & Costantini, G. (2025). Beyond breathalyzers: AI-powered speech analysis for alcohol intoxication detection. Expert Systems with Applications, 262, 125656

work page 2025
[24]

H., Grantham, D

Ashmead, D. H., Grantham, D. W., Maloff, E. S., Hornsby, B., Nakamura, T., Davis, T. J., ... & Rushing, E. G. (2012). Auditory perception of motor vehicle travel paths. Human factors, 54(3), 437-453

work page 2012
[25]

Sun, H., Liu, X., Xu, K., Miao, J., & Luo, Q. (2021). Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797

work page arXiv 2021
[26]

O., Rashid, R

Albaji, A. O., Rashid, R. B. A., & Abdul Hamid, S. Z. (2023). Investigation on machine learning approaches for environmental noise classifications. Journal of Electrical and Computer Engineering, 2023(1), 3615137

work page 2023
[27]

S., Amaya Guzmán, B., Aya-Parra, P

Remolina Soto, M. S., Amaya Guzmán, B., Aya-Parra, P. A., Perdomo, O. J., Becerra-Fernandez, M., & Sarmiento-Rojas, J. (2025). Intelligent Classification of Urban Noise Sources Using TinyML: Towards Efficient Noise Management in Smart Cities. Sensors, 25(20), 6361

work page 2025
[28]

Liang, R., Liu, W., Li, W., & Wu, Z. (2022). A traffic noise source identification method for buildings adjacent to multiple transport infrastructures based on deep learning. Building and Environment, 211, 108764

work page 2022
[29]

E., Heck, L., Peters, S., & Hansen, J

Weng, F., Angkititrakul, P., Shriberg, E. E., Heck, L., Peters, S., & Hansen, J. H. (2016). Conversational in-vehicle dialog systems: The past, present, and future. IEEE Signal Processing Magazine, 33(6), 49-60

work page 2016
[30]

Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

Hu, T., Liu, X., Wang, S., Zhu, Y., Liang, A., Kong, L., ... & Liang, J. (2025). Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.16760

work page arXiv 2025
[31]

& Van Den Hengel, A

Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & Van Den Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real Greer 18 environments. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3674-3683)

work page 2018
[32]

Zhu, Y., Wang, S., Zhong, W., Shen, N., Li, Y., Wang, S., ... & Li, L. (2025). A Survey on Large Language Model-Powered Autonomous Driving. Engineering

work page 2025
[33]

& Ran, B

You, J., Jiang, Z., Huang, Z., Shi, H., Gan, R., Wu, K., ... & Ran, B. (2026). V2x-vlm: End-to-end v2x cooperative autonomous driving through large vision-language models. Transportation Research Part C: Emerging Technologies, 183, 105457

work page 2026
[34]

Huang, Z., Sheng, Z., Qu, Y., You, J., & Chen, S. (2025). Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driving. Transportation Research Part C: Emerging Technologies, 180, 105321

work page 2025
[35]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.,

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR

work page 2021
[36]

(2025, September)

Gao, H., Wang, Z., Li, Y., Long, K., Yang, M., & Shen, Y. (2025, September). A survey for foundation models in autonomous driving. In 2025 6th International Conference on Computer Vision and Data Mining (ICCVDM) (pp. 63-71). IEEE

work page 2025
[37]

Sathyam, R., & Li, Y. (2025). Foundation Models for Autonomous Driving Perception: A Survey Through Core Capabilities. IEEE Open Journal of Vehicular Technology

work page 2025
[38]

M., Shannon, T., & Tippelhofer, M

Tawari, A., Sivaraman, S., Trivedi, M. M., Shannon, T., & Tippelhofer, M. (2014, June). Looking-in and looking-out vision for urban intelligent assistance: Estimation of driver attentive state and dynamic surround for safe merging and braking. In 2014 IEEE Intelligent Vehicles Symposium Proceedings (pp. 115-120). IEEE

work page 2014
[39]

Rangesh, A., Deo, N., Greer, R., Gunaratne, P., & Trivedi, M. M. (2021, September). Autonomous vehicles that alert humans to take-over controls: Modeling with real-world data. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) (pp. 231-236). IEEE

work page 2021
[40]

(2002, June)

Owechko, Y., Srinivasa, N., Medasani, S., & Boscolo, R. (2002, June). Vision-based fusion system for smart airbag applications. In Intelligent Vehicle Symposium, 2002. IEEE (Vol. 1, pp. 245-250). IEEE

work page 2002
[41]

Ho, C., & Spence, C. (2005). Assessing the effectiveness of various auditory cues in capturing a driver's visual attention. Journal of experimental psychology: Applied, 11(3), 157

work page 2005
[42]

E., & Lee, J

Wiese, E. E., & Lee, J. D. (2004). Auditory alerts for in-vehicle information systems: The effects of temporal conflict and sound parameters on driver attitudes and performance. Ergonomics, 47(9), 965-986

work page 2004
[43]

Gray, R. (2011). Looming auditory collision warnings for driving. Human factors, 53(1), 63-74. 45. Li, X., & Xu, Z. (2024, November). The Impact of Auditory Warning Types and Emergency Obstacle Avoidance Takeover Scenarios on Takeover Behavior. In Proceedings of the 26th International Conference on Multimodal Interaction (pp. 134-143)

work page 2011
[44]

& Anguelov, D

Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., ... & Anguelov, D. (2020). Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2446-2454)

work page 2020
[45]

H., Vora, S., Liong, V

Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., ... & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11621-11631)

work page 2020
[46]

F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A.,

Chang, M. F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., ... & Hays, J. (2019). Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8748-8757)

work page 2019
[47]

Geyer, J., Kassahun, Y., Mahmudi, M., Ricou, X., Durgesh, R., Chung, A. S., ... & Schuberth, P. (2020). A2d2: Audi autonomous driving dataset. arXiv preprint arXiv:2004.06320

work page arXiv 2020
[48]

& Knoll, A

Liu, M., Yurtsever, E., Fossaert, J., Zhou, X., Zimmer, W., Cui, Y., ... & Knoll, A. C. (2024). A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook. IEEE Transactions on Intelligent Vehicles. Greer 19

work page 2024
[49]

Greer, R., Antoniussen, B., Møgelmose, A., & Trivedi, M. (2025). Language-driven active learning for diverse open-set 3d object detection. In Proceedings of the Winter Conference on Applications of Computer Vision (pp. 980-988)

work page 2025
[50]

Keskar, A., Perisetla, S., & Greer, R. (2025). Evaluating multimodal vision-language model prompting strategies for visual question answering in road scene understanding. In Proceedings of the Winter Conference on Applications of Computer Vision (pp. 1027-1036)

work page 2025
[51]

Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

Gopalkrishnan, A., Greer, R., & Trivedi, M. Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving. In First Vision and Language for Autonomous Driving and Robotics Workshop at CVPR 2024

work page 2024
[52]

Shriram, S., Perisetla, S., Keskar, A., Krishnaswamy, H., Bossen, T. E. W., Møgelmose, A., & Greer, R. (2025). Towards a multi-agent vision-language system for zero-shot novel hazardous object detection for autonomous driving safety. IEEE RAS Conference on Automation Science and Engineering, 2025

work page 2025
[53]

(2024, October)

Choi, L., & Greer, R. (2024, October). Evaluating vision-language models for zero-shot detection, classification, and association of motorcycles, passengers, and helmets. In 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall) (pp. 1-7). IEEE

work page 2024
[54]

Greer, R., & Trivedi, M. (2024). Towards explainable, safe autonomous driving with language embeddings for novelty identification and active learning: Framework and experimental analysis with real-world data sets. arXiv preprint arXiv:2402.07320

work page arXiv 2024
[55]

and Kim, D

Park, S., Lee, M., Kang, J., Choi, H., Park, Y., Cho, J., Lee, A. and Kim, D. (2024). Vlaad: Vision and language assistant for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 980-987)

work page 2024
[56]

and Yamamoto, I

Arai, H., Miwa, K., Sasaki, K., Watanabe, K., Yamaguchi, Y., Aoki, S. and Yamamoto, I. (2025). Covla: Comprehensive vision-language-action dataset for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 1933-1943). IEEE

work page 2025
[57]

and Houben, S

Fingscheidt, T., Gottschalk, H. and Houben, S. (2022). Deep neural networks and data for automated driving: Robustness, uncertainty quantification, and insights towards safety (p. 427). Springer Nature

work page 2022
[58]

and Greer, R

Dubnov, S. and Greer, R. (2023). Deep and shallow: Machine learning in music and audio. Chapman and Hall/CRC

work page 2023
[59]

Ittichaichareon, C., Suksri, S., & Yingthawornsuk, T. (2012). Speech recognition using MFCC. In International conference on computer graphics, simulation and modeling (Vol. 9, p. 2012)

work page 2012
[60]

and Van Leeuwen, D.A

Vaessen, N. and Van Leeuwen, D.A. (2022). Fine-tuning wav2vec2 for speaker recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7967-7971). IEEE

work page 2022
[61]

and Strik, H

Xue, W., Cucchiarini, C., van Hout, R.W.N.M. and Strik, H. (2019). Acoustic correlates of speech intelligibility. the usability of the egemaps feature set for atypical speech

work page 2019
[62]

Martinez-Sanchez, A., Roy, P., Greer, R. (2026). Natural Language Instructions for Scene-Responsive Human-In-The-Loop Motion Planning in Autonomous Driving Using Vision-Language-Action Models. arXiv preprint

work page 2026
[63]

and Greer, R

Bossen, T.E., Møgelmose, A. and Greer, R. (2025). Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 4779-4788). Greer 20

work page 2025

[1] [1]

Opensmile: the munich versatile and fast open-source audio feature extractor,

F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462

work page 2010

[2] [2]

The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015

work page 2015

[3] [3]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

work page 2020

[4] [4]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022

[5] [5]

Murphy-Chutorian, E., & Trivedi, M. M. (2008). Head pose estimation in computer vision: A survey. IEEE transactions on pattern analysis and machine intelligence, 31(4), 607-626

work page 2008

[6] [6]

Vicente, F., Huang, Z., Xiong, X., De la Torre, F., Zhang, W., & Levi, D. (2015). Driver gaze tracking and eyes off the road detection system. IEEE Transactions on Intelligent Transportation Systems, 16(4), 2014-2027

work page 2015

[7] [7]

R., & Strayer, D

Lohani, M., Payne, B. R., & Strayer, D. L. (2019). A review of psychophysiological measures to assess cognitive states in real-world driving. Frontiers in human neuroscience, 13, 57

work page 2019

[8] [8]

U., Awan, S

Ramzan, M., Khan, H. U., Awan, S. M., Ismail, A., Ilyas, M., & Mahmood, A. (2019). A survey on state-of-the-art drowsiness detection techniques. IEEE Access, 7, 61904-61919

work page 2019

[9] [9]

Kashevnik, A., Shchedrin, R., Kaiser, C., & Stocker, A. (2021). Driver distraction detection methods: A literature review and framework. IEEE Access, 9, 60063-60076

work page 2021

[10] [10]

Paxion, J., Galy, E., & Berthelon, C. (2014). Mental workload and driving. Frontiers in psychology, 5, 1344

work page 2014

[11] [11]

Deo, N., & Trivedi, M. M. (2019). Looking at the driver/rider in autonomous vehicles to predict take-over readiness. IEEE Transactions on Intelligent Vehicles, 5(1), 41-52. Greer 17

work page 2019

[12] [12]

Greer, R., Deo, N., Rangesh, A., Gunaratne, P., & Trivedi, M. (2023). Safe control transitions: Machine vision based observable readiness index and data-driven takeover time prediction. International Technical Symposium on the Enhanced Safety of Vehicles (ESV)

work page 2023

[13] [13]

doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision–Language Navigation,

Roy, P., Perisetla, S., Shriram, S., Krishnaswamy, H., Keskar, A., and Greer, R., “doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision–Language Navigation,” arXiv preprint arXiv:2412.05893 , 2024

work page arXiv 2024

[14] [14]

nuScenes: A multimodal dataset for autonomous driving,

H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020

[15] [15]

S., Ali, S

Al-Quraishi, M. S., Ali, S. S. A., Muhammad, A. Q., Tang, T. B., & Elferik, S. (2024). Technologies for detecting and monitoring drivers' states: A systematic review. Heliyon, 10(20)

work page 2024

[16] [16]

Sahayadhas, A., Sundaraj, K., & Murugappan, M. (2012). Detecting driver drowsiness based on sensors: a review. Sensors, 12(12), 16937-16953

work page 2012

[17] [17]

Nair, A., Patil, V., Nair, R., Shetty, A., & Cherian, M. (2024). A review on recent driver safety systems and its emerging solutions. International Journal of Computers and Applications, 46(3), 137-151

work page 2024

[18] [18]

(2017, February)

Zhang, F., Su, J., Geng, L., & Xiao, Z. (2017, February). Driver fatigue detection based on eye state recognition. In 2017 International Conference on Machine Vision and Information Technology (CMVIT) (pp. 105-110). IEEE

work page 2017

[19] [19]

H., Pourshahabi, M

Sigari, M. H., Pourshahabi, M. R., Soryani, M., & Fathy, M. (2014). A review on driver face monitoring systems for fatigue and distraction detection. International Journal of Advanced Science and Technology, 64, 73-100

work page 2014

[20] [20]

Suffoletto, B., Anwar, A., Glaister, S., & Sejdic, E. (2023). Detection of alcohol intoxication using voice features: a controlled laboratory study. Journal of studies on alcohol and drugs, 84(6), 808-813

work page 2023

[21] [21]

(2008, March)

Yin, B., Chen, F., Ruiz, N., & Ambikairajah, E. (2008, March). Speech-based cognitive load monitoring system. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 2041-2044). IEEE

work page 2008

[22] [22]

Van Puyvelde, M., Neyt, X., McGlone, F., & Pattyn, N. (2018). Voice stress analysis: A new framework for voice and effort in human performance. Frontiers in psychology, 9, 1994

work page 2018

[23] [23]

Amato, F., Cesarini, V., Olmo, G., Saggio, G., & Costantini, G. (2025). Beyond breathalyzers: AI-powered speech analysis for alcohol intoxication detection. Expert Systems with Applications, 262, 125656

work page 2025

[24] [24]

H., Grantham, D

Ashmead, D. H., Grantham, D. W., Maloff, E. S., Hornsby, B., Nakamura, T., Davis, T. J., ... & Rushing, E. G. (2012). Auditory perception of motor vehicle travel paths. Human factors, 54(3), 437-453

work page 2012

[25] [25]

Sun, H., Liu, X., Xu, K., Miao, J., & Luo, Q. (2021). Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797

work page arXiv 2021

[26] [26]

O., Rashid, R

Albaji, A. O., Rashid, R. B. A., & Abdul Hamid, S. Z. (2023). Investigation on machine learning approaches for environmental noise classifications. Journal of Electrical and Computer Engineering, 2023(1), 3615137

work page 2023

[27] [27]

S., Amaya Guzmán, B., Aya-Parra, P

Remolina Soto, M. S., Amaya Guzmán, B., Aya-Parra, P. A., Perdomo, O. J., Becerra-Fernandez, M., & Sarmiento-Rojas, J. (2025). Intelligent Classification of Urban Noise Sources Using TinyML: Towards Efficient Noise Management in Smart Cities. Sensors, 25(20), 6361

work page 2025

[28] [28]

Liang, R., Liu, W., Li, W., & Wu, Z. (2022). A traffic noise source identification method for buildings adjacent to multiple transport infrastructures based on deep learning. Building and Environment, 211, 108764

work page 2022

[29] [29]

E., Heck, L., Peters, S., & Hansen, J

Weng, F., Angkititrakul, P., Shriberg, E. E., Heck, L., Peters, S., & Hansen, J. H. (2016). Conversational in-vehicle dialog systems: The past, present, and future. IEEE Signal Processing Magazine, 33(6), 49-60

work page 2016

[30] [30]

Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

Hu, T., Liu, X., Wang, S., Zhu, Y., Liang, A., Kong, L., ... & Liang, J. (2025). Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.16760

work page arXiv 2025

[31] [31]

& Van Den Hengel, A

Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & Van Den Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real Greer 18 environments. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3674-3683)

work page 2018

[32] [32]

Zhu, Y., Wang, S., Zhong, W., Shen, N., Li, Y., Wang, S., ... & Li, L. (2025). A Survey on Large Language Model-Powered Autonomous Driving. Engineering

work page 2025

[33] [33]

& Ran, B

You, J., Jiang, Z., Huang, Z., Shi, H., Gan, R., Wu, K., ... & Ran, B. (2026). V2x-vlm: End-to-end v2x cooperative autonomous driving through large vision-language models. Transportation Research Part C: Emerging Technologies, 183, 105457

work page 2026

[34] [34]

Huang, Z., Sheng, Z., Qu, Y., You, J., & Chen, S. (2025). Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driving. Transportation Research Part C: Emerging Technologies, 180, 105321

work page 2025

[35] [35]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.,

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR

work page 2021

[36] [36]

(2025, September)

Gao, H., Wang, Z., Li, Y., Long, K., Yang, M., & Shen, Y. (2025, September). A survey for foundation models in autonomous driving. In 2025 6th International Conference on Computer Vision and Data Mining (ICCVDM) (pp. 63-71). IEEE

work page 2025

[37] [37]

Sathyam, R., & Li, Y. (2025). Foundation Models for Autonomous Driving Perception: A Survey Through Core Capabilities. IEEE Open Journal of Vehicular Technology

work page 2025

[38] [38]

M., Shannon, T., & Tippelhofer, M

Tawari, A., Sivaraman, S., Trivedi, M. M., Shannon, T., & Tippelhofer, M. (2014, June). Looking-in and looking-out vision for urban intelligent assistance: Estimation of driver attentive state and dynamic surround for safe merging and braking. In 2014 IEEE Intelligent Vehicles Symposium Proceedings (pp. 115-120). IEEE

work page 2014

[39] [39]

Rangesh, A., Deo, N., Greer, R., Gunaratne, P., & Trivedi, M. M. (2021, September). Autonomous vehicles that alert humans to take-over controls: Modeling with real-world data. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) (pp. 231-236). IEEE

work page 2021

[40] [40]

(2002, June)

Owechko, Y., Srinivasa, N., Medasani, S., & Boscolo, R. (2002, June). Vision-based fusion system for smart airbag applications. In Intelligent Vehicle Symposium, 2002. IEEE (Vol. 1, pp. 245-250). IEEE

work page 2002

[41] [41]

Ho, C., & Spence, C. (2005). Assessing the effectiveness of various auditory cues in capturing a driver's visual attention. Journal of experimental psychology: Applied, 11(3), 157

work page 2005

[42] [42]

E., & Lee, J

Wiese, E. E., & Lee, J. D. (2004). Auditory alerts for in-vehicle information systems: The effects of temporal conflict and sound parameters on driver attitudes and performance. Ergonomics, 47(9), 965-986

work page 2004

[43] [43]

Gray, R. (2011). Looming auditory collision warnings for driving. Human factors, 53(1), 63-74. 45. Li, X., & Xu, Z. (2024, November). The Impact of Auditory Warning Types and Emergency Obstacle Avoidance Takeover Scenarios on Takeover Behavior. In Proceedings of the 26th International Conference on Multimodal Interaction (pp. 134-143)

work page 2011

[44] [44]

& Anguelov, D

Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., ... & Anguelov, D. (2020). Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2446-2454)

work page 2020

[45] [45]

H., Vora, S., Liong, V

Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., ... & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11621-11631)

work page 2020

[46] [46]

F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A.,

Chang, M. F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., ... & Hays, J. (2019). Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8748-8757)

work page 2019

[47] [47]

Geyer, J., Kassahun, Y., Mahmudi, M., Ricou, X., Durgesh, R., Chung, A. S., ... & Schuberth, P. (2020). A2d2: Audi autonomous driving dataset. arXiv preprint arXiv:2004.06320

work page arXiv 2020

[48] [48]

& Knoll, A

Liu, M., Yurtsever, E., Fossaert, J., Zhou, X., Zimmer, W., Cui, Y., ... & Knoll, A. C. (2024). A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook. IEEE Transactions on Intelligent Vehicles. Greer 19

work page 2024

[49] [49]

Greer, R., Antoniussen, B., Møgelmose, A., & Trivedi, M. (2025). Language-driven active learning for diverse open-set 3d object detection. In Proceedings of the Winter Conference on Applications of Computer Vision (pp. 980-988)

work page 2025

[50] [50]

Keskar, A., Perisetla, S., & Greer, R. (2025). Evaluating multimodal vision-language model prompting strategies for visual question answering in road scene understanding. In Proceedings of the Winter Conference on Applications of Computer Vision (pp. 1027-1036)

work page 2025

[51] [51]

Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

Gopalkrishnan, A., Greer, R., & Trivedi, M. Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving. In First Vision and Language for Autonomous Driving and Robotics Workshop at CVPR 2024

work page 2024

[52] [52]

Shriram, S., Perisetla, S., Keskar, A., Krishnaswamy, H., Bossen, T. E. W., Møgelmose, A., & Greer, R. (2025). Towards a multi-agent vision-language system for zero-shot novel hazardous object detection for autonomous driving safety. IEEE RAS Conference on Automation Science and Engineering, 2025

work page 2025

[53] [53]

(2024, October)

Choi, L., & Greer, R. (2024, October). Evaluating vision-language models for zero-shot detection, classification, and association of motorcycles, passengers, and helmets. In 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall) (pp. 1-7). IEEE

work page 2024

[54] [54]

Greer, R., & Trivedi, M. (2024). Towards explainable, safe autonomous driving with language embeddings for novelty identification and active learning: Framework and experimental analysis with real-world data sets. arXiv preprint arXiv:2402.07320

work page arXiv 2024

[55] [55]

and Kim, D

Park, S., Lee, M., Kang, J., Choi, H., Park, Y., Cho, J., Lee, A. and Kim, D. (2024). Vlaad: Vision and language assistant for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 980-987)

work page 2024

[56] [56]

and Yamamoto, I

Arai, H., Miwa, K., Sasaki, K., Watanabe, K., Yamaguchi, Y., Aoki, S. and Yamamoto, I. (2025). Covla: Comprehensive vision-language-action dataset for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 1933-1943). IEEE

work page 2025

[57] [57]

and Houben, S

Fingscheidt, T., Gottschalk, H. and Houben, S. (2022). Deep neural networks and data for automated driving: Robustness, uncertainty quantification, and insights towards safety (p. 427). Springer Nature

work page 2022

[58] [58]

and Greer, R

Dubnov, S. and Greer, R. (2023). Deep and shallow: Machine learning in music and audio. Chapman and Hall/CRC

work page 2023

[59] [59]

Ittichaichareon, C., Suksri, S., & Yingthawornsuk, T. (2012). Speech recognition using MFCC. In International conference on computer graphics, simulation and modeling (Vol. 9, p. 2012)

work page 2012

[60] [60]

and Van Leeuwen, D.A

Vaessen, N. and Van Leeuwen, D.A. (2022). Fine-tuning wav2vec2 for speaker recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7967-7971). IEEE

work page 2022

[61] [61]

and Strik, H

Xue, W., Cucchiarini, C., van Hout, R.W.N.M. and Strik, H. (2019). Acoustic correlates of speech intelligibility. the usability of the egemaps feature set for atypical speech

work page 2019

[62] [62]

Martinez-Sanchez, A., Roy, P., Greer, R. (2026). Natural Language Instructions for Scene-Responsive Human-In-The-Loop Motion Planning in Autonomous Driving Using Vision-Language-Action Models. arXiv preprint

work page 2026

[63] [63]

and Greer, R

Bossen, T.E., Møgelmose, A. and Greer, R. (2025). Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 4779-4788). Greer 20

work page 2025