pith. sign in

arxiv: 2602.07668 · v2 · submitted 2026-02-07 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

Pith reviewed 2026-05-16 05:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords multimodal fusiondriver state assessmentaudio sensingvehicle safetyL-LIO frameworkintelligent vehiclessensor fusionimpairment detection
0
0 comments X

The pith

Adding audio signals to visual sensing forms the L-LIO framework that improves driver state assessment and vehicle environment understanding through multimodal fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the existing LILO visual framework by incorporating audio inputs to create the L-LIO system. This multimodal approach targets better evaluation of driver conditions such as impairment and clearer understanding of surrounding scenes. The authors present three pilot cases using custom real-world audio datasets: classifying driver speech for intoxication or similar states, processing passenger spoken instructions for vehicle planning, and using sound to clarify external agents' gestures when vision alone is ambiguous. They report that audio supplies useful safety information in nuanced situations where visual data falls short. The work notes ongoing challenges with noise, privacy, and consistency across people but positions the fused sensing as a route to stronger safety interventions.

Core claim

The central claim is that expanding the looking-in-looking-out framework with audio signals produces the looking-and-listening inside-and-outside framework, which strengthens driver state assessment and environment understanding via multimodal sensor fusion, as shown in pilot evaluations of speech-based impairment classification, natural-language passenger instructions, and audio disambiguation of external guidance.

What carries the argument

The L-LIO framework, which fuses audio and visual signals collected inside and outside the vehicle to support safety-relevant decisions.

If this is right

  • Supervised models trained on driver speech audio can classify states such as intoxication.
  • Passenger spoken instructions can be collected and aligned to guide vehicle planning systems.
  • Audio cues can resolve ambiguities in external agents' gestures and guidance that vision-only systems miss.
  • Multimodal fusion of audio and visual data opens new paths for safety interventions in intelligent vehicles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to passenger monitoring and external pedestrian interaction in shared autonomy settings.
  • Systems built on L-LIO may require separate privacy-preserving audio processing pipelines to reach widespread use.
  • Real-time fusion algorithms would need testing against varying cabin and road noise profiles before deployment.
  • The framework suggests a general pattern for adding sound-based channels to other vision-centric vehicle perception tasks.

Load-bearing premise

Audio signals can supply reliable safety insights in real-world noisy conditions without major interference, privacy conflicts, or loss of performance across different people.

What would settle it

A controlled test showing that audio-based classification of driver impairment or external guidance performs at chance level in typical driving noise levels or across varied speakers.

read the original abstract

The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes extending the looking-in-looking-out (LILO) framework to a looking-and-listening inside-and-outside (L-LIO) framework by adding audio signals for multimodal fusion. It claims this enhances driver state assessment (e.g., speech impairment classification), passenger natural language instructions for planning, and disambiguation of external agents, supported by qualitative pilot findings on custom in-vehicle and external audio datasets collected in real-world environments.

Significance. If the pilot cases were supported by quantitative metrics and baselines, L-LIO could meaningfully advance multimodal safety systems in intelligent vehicles by addressing vision-only limitations in context-rich scenarios. As presented, the conceptual framing is clear but the absence of verifiable results limits its contribution to the literature on driver monitoring and autonomous decision-making.

major comments (2)
  1. [Evaluation of example cases] The three example cases in the evaluation section are described only qualitatively (driver speech impairment, passenger instructions such as 'turn after that red building', and audio disambiguation of external agents) with no reported accuracies, F1 scores, success rates, dataset sizes, collection protocols, model architectures, or comparisons to vision-only baselines. This directly undermines the central claim that audio 'yields safety-relevant insights' and 'enhances' assessment.
  2. [Challenges and future work] The abstract and challenges paragraph mention ambient noise interference, privacy considerations, and robustness across subjects as open issues, yet no experiments, noise-handling methods, or subject-variability tests are provided to assess whether audio can reliably augment safety in dynamic real-world conditions.
minor comments (2)
  1. [Abstract] The abstract states 'pilot findings show that audio yields safety-relevant insights' without enumerating what those specific findings are or referencing any supporting table or figure.
  2. [Framework description] Notation for the proposed L-LIO framework is introduced at a high level but without a diagram or pseudocode clarifying the multimodal fusion architecture relative to the original LILO.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript proposing the L-LIO framework. We address each major comment below and outline the revisions we will make to better scope our claims and clarify the preliminary nature of the work.

read point-by-point responses
  1. Referee: [Evaluation of example cases] The three example cases in the evaluation section are described only qualitatively (driver speech impairment, passenger instructions such as 'turn after that red building', and audio disambiguation of external agents) with no reported accuracies, F1 scores, success rates, dataset sizes, collection protocols, model architectures, or comparisons to vision-only baselines. This directly undermines the central claim that audio 'yields safety-relevant insights' and 'enhances' assessment.

    Authors: We appreciate this observation and agree that the evaluations are qualitative. The manuscript is positioned as a conceptual proposal for extending LILO to L-LIO, using three illustrative pilot cases collected in real-world environments to motivate the framework rather than to deliver a full empirical study with benchmarks. Dataset sizes and collection protocols for the custom in-vehicle and external audio samples are described in the text, but no quantitative metrics or vision-only baselines were computed. We will revise the evaluation section to explicitly frame these as preliminary qualitative examples, tone down claims of enhancement to 'potential' insights, and state that quantitative comparisons are reserved for future work. This will align the presentation with the evidence provided. revision: partial

  2. Referee: [Challenges and future work] The abstract and challenges paragraph mention ambient noise interference, privacy considerations, and robustness across subjects as open issues, yet no experiments, noise-handling methods, or subject-variability tests are provided to assess whether audio can reliably augment safety in dynamic real-world conditions.

    Authors: We agree that the challenges are presented without accompanying experiments or methods. As the manuscript focuses on introducing the multimodal framework and motivating its use via pilots, detailed validation of noise robustness or subject variability falls outside the current scope. We will revise the abstract and challenges paragraph to more clearly label these as open issues for future research and briefly outline example directions, such as adaptive filtering for noise or multi-subject data collection protocols, without claiming any current solutions. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual framework proposal without derivations or self-referential reductions

full rationale

The manuscript proposes the L-LIO framework as a multimodal extension of the existing LILO framework by adding audio signals for driver state assessment and scene understanding. No equations, parameters, or quantitative predictions appear anywhere in the text. The three pilot cases (speech impairment classification, passenger instructions, and external agent disambiguation) are described qualitatively with no fitted models, no performance metrics, and no claims that a derived quantity equals an input by construction. LILO is referenced as prior work but is not used to justify any uniqueness theorem or ansatz within this paper; the central claim remains a high-level suggestion for sensor fusion rather than a derived result. The absence of any load-bearing mathematical or predictive step means the derivation chain is empty and the proposal is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests primarily on domain assumptions about the utility of audio in vehicle contexts, with no free parameters or formal derivations; the L-LIO framework itself is the main invented entity.

axioms (1)
  • domain assumption Audio signals provide safety-relevant information not captured by vision alone in driver and scene understanding tasks.
    Invoked throughout the proposal of L-LIO and the three example cases in the abstract.
invented entities (1)
  • L-LIO framework no independent evidence
    purpose: To integrate audio sensing with visual data for enhanced multimodal driver state assessment and environment understanding.
    Newly proposed augmentation to the prior LILO framework.

pith-pipeline@v0.9.0 · 5647 in / 1260 out tokens · 68388 ms · 2026-05-16T05:53:54.738441+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages

  1. [1]

    Opensmile: the munich versatile and fast open-source audio feature extractor,

    F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462

  2. [2]

    The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

    F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015

  3. [3]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

  4. [4]

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  5. [5]

    Murphy-Chutorian, E., & Trivedi, M. M. (2008). Head pose estimation in computer vision: A survey. IEEE transactions on pattern analysis and machine intelligence, 31(4), 607-626

  6. [6]

    Vicente, F., Huang, Z., Xiong, X., De la Torre, F., Zhang, W., & Levi, D. (2015). Driver gaze tracking and eyes off the road detection system. IEEE Transactions on Intelligent Transportation Systems, 16(4), 2014-2027

  7. [7]

    R., & Strayer, D

    Lohani, M., Payne, B. R., & Strayer, D. L. (2019). A review of psychophysiological measures to assess cognitive states in real-world driving. Frontiers in human neuroscience, 13, 57

  8. [8]

    U., Awan, S

    Ramzan, M., Khan, H. U., Awan, S. M., Ismail, A., Ilyas, M., & Mahmood, A. (2019). A survey on state-of-the-art drowsiness detection techniques. IEEE Access, 7, 61904-61919

  9. [9]

    Kashevnik, A., Shchedrin, R., Kaiser, C., & Stocker, A. (2021). Driver distraction detection methods: A literature review and framework. IEEE Access, 9, 60063-60076

  10. [10]

    Paxion, J., Galy, E., & Berthelon, C. (2014). Mental workload and driving. Frontiers in psychology, 5, 1344

  11. [11]

    Deo, N., & Trivedi, M. M. (2019). Looking at the driver/rider in autonomous vehicles to predict take-over readiness. IEEE Transactions on Intelligent Vehicles, 5(1), 41-52. Greer 17

  12. [12]

    Greer, R., Deo, N., Rangesh, A., Gunaratne, P., & Trivedi, M. (2023). Safe control transitions: Machine vision based observable readiness index and data-driven takeover time prediction. International Technical Symposium on the Enhanced Safety of Vehicles (ESV)

  13. [13]

    doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision–Language Navigation,

    Roy, P., Perisetla, S., Shriram, S., Krishnaswamy, H., Keskar, A., and Greer, R., “doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision–Language Navigation,” arXiv preprint arXiv:2412.05893 , 2024

  14. [14]

    nuScenes: A multimodal dataset for autonomous driving,

    H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2020

  15. [15]

    S., Ali, S

    Al-Quraishi, M. S., Ali, S. S. A., Muhammad, A. Q., Tang, T. B., & Elferik, S. (2024). Technologies for detecting and monitoring drivers' states: A systematic review. Heliyon, 10(20)

  16. [16]

    Sahayadhas, A., Sundaraj, K., & Murugappan, M. (2012). Detecting driver drowsiness based on sensors: a review. Sensors, 12(12), 16937-16953

  17. [17]

    Nair, A., Patil, V., Nair, R., Shetty, A., & Cherian, M. (2024). A review on recent driver safety systems and its emerging solutions. International Journal of Computers and Applications, 46(3), 137-151

  18. [18]

    (2017, February)

    Zhang, F., Su, J., Geng, L., & Xiao, Z. (2017, February). Driver fatigue detection based on eye state recognition. In 2017 International Conference on Machine Vision and Information Technology (CMVIT) (pp. 105-110). IEEE

  19. [19]

    H., Pourshahabi, M

    Sigari, M. H., Pourshahabi, M. R., Soryani, M., & Fathy, M. (2014). A review on driver face monitoring systems for fatigue and distraction detection. International Journal of Advanced Science and Technology, 64, 73-100

  20. [20]

    Suffoletto, B., Anwar, A., Glaister, S., & Sejdic, E. (2023). Detection of alcohol intoxication using voice features: a controlled laboratory study. Journal of studies on alcohol and drugs, 84(6), 808-813

  21. [21]

    (2008, March)

    Yin, B., Chen, F., Ruiz, N., & Ambikairajah, E. (2008, March). Speech-based cognitive load monitoring system. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 2041-2044). IEEE

  22. [22]

    Van Puyvelde, M., Neyt, X., McGlone, F., & Pattyn, N. (2018). Voice stress analysis: A new framework for voice and effort in human performance. Frontiers in psychology, 9, 1994

  23. [23]

    Amato, F., Cesarini, V., Olmo, G., Saggio, G., & Costantini, G. (2025). Beyond breathalyzers: AI-powered speech analysis for alcohol intoxication detection. Expert Systems with Applications, 262, 125656

  24. [24]

    H., Grantham, D

    Ashmead, D. H., Grantham, D. W., Maloff, E. S., Hornsby, B., Nakamura, T., Davis, T. J., ... & Rushing, E. G. (2012). Auditory perception of motor vehicle travel paths. Human factors, 54(3), 437-453

  25. [25]

    Sun, H., Liu, X., Xu, K., Miao, J., & Luo, Q. (2021). Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797

  26. [26]

    O., Rashid, R

    Albaji, A. O., Rashid, R. B. A., & Abdul Hamid, S. Z. (2023). Investigation on machine learning approaches for environmental noise classifications. Journal of Electrical and Computer Engineering, 2023(1), 3615137

  27. [27]

    S., Amaya Guzmán, B., Aya-Parra, P

    Remolina Soto, M. S., Amaya Guzmán, B., Aya-Parra, P. A., Perdomo, O. J., Becerra-Fernandez, M., & Sarmiento-Rojas, J. (2025). Intelligent Classification of Urban Noise Sources Using TinyML: Towards Efficient Noise Management in Smart Cities. Sensors, 25(20), 6361

  28. [28]

    Liang, R., Liu, W., Li, W., & Wu, Z. (2022). A traffic noise source identification method for buildings adjacent to multiple transport infrastructures based on deep learning. Building and Environment, 211, 108764

  29. [29]

    E., Heck, L., Peters, S., & Hansen, J

    Weng, F., Angkititrakul, P., Shriberg, E. E., Heck, L., Peters, S., & Hansen, J. H. (2016). Conversational in-vehicle dialog systems: The past, present, and future. IEEE Signal Processing Magazine, 33(6), 49-60

  30. [30]

    Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

    Hu, T., Liu, X., Wang, S., Zhu, Y., Liang, A., Kong, L., ... & Liang, J. (2025). Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.16760

  31. [31]

    & Van Den Hengel, A

    Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & Van Den Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real Greer 18 environments. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3674-3683)

  32. [32]

    Zhu, Y., Wang, S., Zhong, W., Shen, N., Li, Y., Wang, S., ... & Li, L. (2025). A Survey on Large Language Model-Powered Autonomous Driving. Engineering

  33. [33]

    & Ran, B

    You, J., Jiang, Z., Huang, Z., Shi, H., Gan, R., Wu, K., ... & Ran, B. (2026). V2x-vlm: End-to-end v2x cooperative autonomous driving through large vision-language models. Transportation Research Part C: Emerging Technologies, 183, 105457

  34. [34]

    Huang, Z., Sheng, Z., Qu, Y., You, J., & Chen, S. (2025). Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driving. Transportation Research Part C: Emerging Technologies, 180, 105321

  35. [35]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.,

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR

  36. [36]

    (2025, September)

    Gao, H., Wang, Z., Li, Y., Long, K., Yang, M., & Shen, Y. (2025, September). A survey for foundation models in autonomous driving. In 2025 6th International Conference on Computer Vision and Data Mining (ICCVDM) (pp. 63-71). IEEE

  37. [37]

    Sathyam, R., & Li, Y. (2025). Foundation Models for Autonomous Driving Perception: A Survey Through Core Capabilities. IEEE Open Journal of Vehicular Technology

  38. [38]

    M., Shannon, T., & Tippelhofer, M

    Tawari, A., Sivaraman, S., Trivedi, M. M., Shannon, T., & Tippelhofer, M. (2014, June). Looking-in and looking-out vision for urban intelligent assistance: Estimation of driver attentive state and dynamic surround for safe merging and braking. In 2014 IEEE Intelligent Vehicles Symposium Proceedings (pp. 115-120). IEEE

  39. [39]

    Rangesh, A., Deo, N., Greer, R., Gunaratne, P., & Trivedi, M. M. (2021, September). Autonomous vehicles that alert humans to take-over controls: Modeling with real-world data. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) (pp. 231-236). IEEE

  40. [40]

    (2002, June)

    Owechko, Y., Srinivasa, N., Medasani, S., & Boscolo, R. (2002, June). Vision-based fusion system for smart airbag applications. In Intelligent Vehicle Symposium, 2002. IEEE (Vol. 1, pp. 245-250). IEEE

  41. [41]

    Ho, C., & Spence, C. (2005). Assessing the effectiveness of various auditory cues in capturing a driver's visual attention. Journal of experimental psychology: Applied, 11(3), 157

  42. [42]

    E., & Lee, J

    Wiese, E. E., & Lee, J. D. (2004). Auditory alerts for in-vehicle information systems: The effects of temporal conflict and sound parameters on driver attitudes and performance. Ergonomics, 47(9), 965-986

  43. [43]

    Gray, R. (2011). Looming auditory collision warnings for driving. Human factors, 53(1), 63-74. 45. Li, X., & Xu, Z. (2024, November). The Impact of Auditory Warning Types and Emergency Obstacle Avoidance Takeover Scenarios on Takeover Behavior. In Proceedings of the 26th International Conference on Multimodal Interaction (pp. 134-143)

  44. [44]

    & Anguelov, D

    Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., ... & Anguelov, D. (2020). Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2446-2454)

  45. [45]

    H., Vora, S., Liong, V

    Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., ... & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11621-11631)

  46. [46]

    F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A.,

    Chang, M. F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., ... & Hays, J. (2019). Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8748-8757)

  47. [47]

    Geyer, J., Kassahun, Y., Mahmudi, M., Ricou, X., Durgesh, R., Chung, A. S., ... & Schuberth, P. (2020). A2d2: Audi autonomous driving dataset. arXiv preprint arXiv:2004.06320

  48. [48]

    & Knoll, A

    Liu, M., Yurtsever, E., Fossaert, J., Zhou, X., Zimmer, W., Cui, Y., ... & Knoll, A. C. (2024). A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook. IEEE Transactions on Intelligent Vehicles. Greer 19

  49. [49]

    Greer, R., Antoniussen, B., Møgelmose, A., & Trivedi, M. (2025). Language-driven active learning for diverse open-set 3d object detection. In Proceedings of the Winter Conference on Applications of Computer Vision (pp. 980-988)

  50. [50]

    Keskar, A., Perisetla, S., & Greer, R. (2025). Evaluating multimodal vision-language model prompting strategies for visual question answering in road scene understanding. In Proceedings of the Winter Conference on Applications of Computer Vision (pp. 1027-1036)

  51. [51]

    Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

    Gopalkrishnan, A., Greer, R., & Trivedi, M. Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving. In First Vision and Language for Autonomous Driving and Robotics Workshop at CVPR 2024

  52. [52]

    Shriram, S., Perisetla, S., Keskar, A., Krishnaswamy, H., Bossen, T. E. W., Møgelmose, A., & Greer, R. (2025). Towards a multi-agent vision-language system for zero-shot novel hazardous object detection for autonomous driving safety. IEEE RAS Conference on Automation Science and Engineering, 2025

  53. [53]

    (2024, October)

    Choi, L., & Greer, R. (2024, October). Evaluating vision-language models for zero-shot detection, classification, and association of motorcycles, passengers, and helmets. In 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall) (pp. 1-7). IEEE

  54. [54]

    Greer, R., & Trivedi, M. (2024). Towards explainable, safe autonomous driving with language embeddings for novelty identification and active learning: Framework and experimental analysis with real-world data sets. arXiv preprint arXiv:2402.07320

  55. [55]

    and Kim, D

    Park, S., Lee, M., Kang, J., Choi, H., Park, Y., Cho, J., Lee, A. and Kim, D. (2024). Vlaad: Vision and language assistant for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 980-987)

  56. [56]

    and Yamamoto, I

    Arai, H., Miwa, K., Sasaki, K., Watanabe, K., Yamaguchi, Y., Aoki, S. and Yamamoto, I. (2025). Covla: Comprehensive vision-language-action dataset for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 1933-1943). IEEE

  57. [57]

    and Houben, S

    Fingscheidt, T., Gottschalk, H. and Houben, S. (2022). Deep neural networks and data for automated driving: Robustness, uncertainty quantification, and insights towards safety (p. 427). Springer Nature

  58. [58]

    and Greer, R

    Dubnov, S. and Greer, R. (2023). Deep and shallow: Machine learning in music and audio. Chapman and Hall/CRC

  59. [59]

    Ittichaichareon, C., Suksri, S., & Yingthawornsuk, T. (2012). Speech recognition using MFCC. In International conference on computer graphics, simulation and modeling (Vol. 9, p. 2012)

  60. [60]

    and Van Leeuwen, D.A

    Vaessen, N. and Van Leeuwen, D.A. (2022). Fine-tuning wav2vec2 for speaker recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7967-7971). IEEE

  61. [61]

    and Strik, H

    Xue, W., Cucchiarini, C., van Hout, R.W.N.M. and Strik, H. (2019). Acoustic correlates of speech intelligibility. the usability of the egemaps feature set for atypical speech

  62. [62]

    Martinez-Sanchez, A., Roy, P., Greer, R. (2026). Natural Language Instructions for Scene-Responsive Human-In-The-Loop Motion Planning in Autonomous Driving Using Vision-Language-Action Models. arXiv preprint

  63. [63]

    and Greer, R

    Bossen, T.E., Møgelmose, A. and Greer, R. (2025). Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 4779-4788). Greer 20