Risk-Aware Selective Multimodal Driver Monitoring with Driver-State World Modeling

Daosheng Qiu; Hao Su; Haozhuang Chi; Shu Long; Wei Zhang; Xinyue Miao; Yongle Dong

arxiv: 2606.26922 · v1 · pith:KSWWEQNTnew · submitted 2026-06-25 · 💻 cs.RO · cs.AI

Risk-Aware Selective Multimodal Driver Monitoring with Driver-State World Modeling

Daosheng Qiu , Haozhuang Chi , Hao Su , Shu Long , Xinyue Miao , Yongle Dong , Wei Zhang This is my paper

Pith reviewed 2026-06-26 04:37 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords driver monitoringselective inferencemultimodal fusionrisk-aware controlautomated vehiclesphysiological signalsworld modeling

0 comments

The pith

A cost-aware gate lets a fast RGB-physiological model abstain on uncertain driver states, cutting unsafe false negatives from 17.37% to about 5% at deployment latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a selective inference system for in-cabin driver monitoring that pairs a lightweight multimodal student with a learned gate. The student fuses cabin video with heart-rate and electrodermal signals to classify driver demand. The gate uses per-sample scores to accept the fast output or trigger safety intervention instead of always running a slower large model. Experiments show the combination lowers missed unsafe states while preserving low latency. A separate world-modeling module is added to forecast future errors and action costs, though it reveals remaining calibration problems across driver groups.

Core claim

Cost-aware selective inference with an RGB-physiological student and learned gate reduces unsafe false negatives from 17.37% under always-fast inference to approximately 5% across seeds while keeping 3 ms inference latency; the student itself reaches 0.7440 Macro-F1 on scenario-induced driver-demand recognition.

What carries the argument

The learned gate that decides per-sample whether to accept the fast RGB-physiological prediction or abstain for safety intervention, using scores that contain information beyond scenario priors.

If this is right

The RGB-physiological student improves over single-modality baselines to 0.7440 Macro-F1 and 0.9099 balanced accuracy with 11.39 M parameters.
Cost-aware selection keeps overall system latency at deployment levels while lowering the unsafe error rate.
Driver-state world modeling supplies predictive signals for future model errors and counterfactual costs.
Worst-group evaluations still show operating-point calibration drift even with the added predictive module.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gate-plus-student pattern could be tested in other latency-critical safety settings such as pedestrian detection or medical alarm systems.
Improving physiological signal alignment across sensors would directly raise the upper bound on student accuracy.
Group-robust calibration techniques would need to be added before the system can be deployed across varied driver populations.

Load-bearing premise

The gate can reliably read sample-level signals to choose abstention without creating new safety risks, and the physiological signals stay synchronized enough for the student model to work.

What would settle it

A controlled test on the same driver-demand scenarios in which turning on the learned gate either raises the overall unsafe false-negative rate above the always-fast baseline or produces new false positives that trigger unnecessary interventions at higher total cost.

Figures

Figures reproduced from arXiv: 2606.26922 by Daosheng Qiu, Hao Su, Haozhuang Chi, Shu Long, Wei Zhang, Xinyue Miao, Yongle Dong.

**Figure 1.** Figure 1: Overview of the proposed framework. The system continuously processes RGB frames and window-level HR/EDA signals through a lightweight fast student. Instead of forcing a mandatory classification, a learned cost-aware gate evaluates instantaneous reliability and predictive evidence from a compact driver-state world modeling module. The gate then decides whether to accept_fast, abstain_warn, slow_replace, or… view at source ↗

**Figure 2.** Figure 2: Selective confusion matrix. Comparison between always-fast inference and the learned cost-aware gate. By explicitly optimizing for asymmetric risk, the learned gate successfully redistributes safety-critical errors (Unsafe FNs, red) into conservative or positive abstentions (orange), while simultaneously increasing the number of correctly accepted high-demand states [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of modality contributions. Case A shows visual cues are sufficient. Case B demonstrates the necessity of multimodal fusion: when visual features are ambiguous (head down), physiological dynamics (EDA/HR drop) successfully recover the true High demand state. Case C shows a failure case where both modalities fail to capture the state change. RGB Case A Inputs 1 -0 -1 0 32 64 96 127 … view at source ↗

**Figure 5.** Figure 5: Calibration, deployment frontier, and matched-coverage safety be [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Continuous driver monitoring in automated vehicles requires low-latency inference while avoiding unsafe decisions under uncertain driver states. Large vision-language models provide broad multimodal priors, but their latency and limited reliability in this setting make them unsuitable as always-on in-cabin monitors. We propose a cost-aware selective inference framework for deployable multimodal driver monitoring. The core system is a lightweight RGB-physiological student that combines in-cabin visual observations with window-level HR/EDA signals, and a learned gate that decides when to accept the fast prediction or abstain for safety intervention. Additional controls show that the learned scores contain sample-level information beyond scenario priors, while exact physiological synchronization remains a limitation. To incorporate predictive evidence, we further study a compact driver-state world modeling module that rolls out latent driver-state features and estimates future fast-model errors and counterfactual system-level action costs. On scenario-induced driver-demand recognition, the RGB-physiological student improves over RGB-only and physiology-only baselines, reaching 0.7440 Macro-F1 and 0.9099 balanced accuracy with 11.39M parameters and 3.08ms inference latency. Cost-aware selective inference reduces unsafe false negatives from 17.37% under always-fast inference to approximately 5% across seeds, while maintaining deployment-level latency. While driver-state world modeling offers valuable predictive signals, worst-group evaluations highlight persistent operating-point calibration drift. Ultimately, reliable edge driver monitoring requires advancing not only perception backbones, but also risk-aware selective control and group-robust calibration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The selective gate cuts unsafe false negatives from 17% to 5% in the reported setup, but the paper's own note on physiological synchronization undercuts how far those gains will carry outside the lab.

read the letter

The main thing to know is that this paper puts together a fast RGB-physiological student model, a learned cost-aware gate for abstention, and a compact driver-state world model to predict future errors. On their scenario-induced task the student reaches 0.7440 Macro-F1 at 3 ms latency, and the selective system drops unsafe false negatives from 17.37 % to roughly 5 % while staying at deployment latency. They also run controls showing the gate uses sample-level signals beyond scenario priors.

That combination is the concrete advance. The numbers are specific, the latency target is realistic for edge use, and they are straightforward about two practical problems: calibration drift on worst groups and the fact that exact physiological synchronization remains a limitation.

The synchronization point is the real soft spot. The reported fusion performance and the gate's safety benefit were measured under conditions the authors themselves flag as idealized. Any real offset between RGB frames and HR/EDA windows will degrade the student outputs and give the gate noisier inputs, so the drop from 17 % to 5 % unsafe errors is unlikely to hold in the car. The world-modeling module is presented as additional predictive evidence but does not appear to fix the calibration or sync issues.

This is aimed at people working on multimodal perception and risk-aware control for vehicles. It has enough empirical grounding and self-critique to deserve a serious referee who can check the data splits, the gate training, and whether the synchronization assumption was stress-tested.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a cost-aware selective inference framework for low-latency multimodal driver monitoring in automated vehicles. A lightweight RGB-physiological student model fuses in-cabin RGB observations with window-level HR/EDA signals to reach 0.7440 Macro-F1 and 0.9099 balanced accuracy (11.39M parameters, 3.08ms latency). A learned gate decides between accepting the fast prediction or abstaining for safety intervention, reducing unsafe false negatives from 17.37% (always-fast inference) to approximately 5% across seeds while preserving deployment latency. A compact driver-state world modeling module is studied for rolling out latent features and estimating future errors and counterfactual costs. The work explicitly notes that exact physiological synchronization remains a limitation and that worst-group evaluations show persistent calibration drift.

Significance. If the reported safety gains hold under realistic synchronization conditions, the selective-inference approach could meaningfully improve the reliability of edge-deployed driver monitoring by trading off latency against risk without always invoking heavy models. The explicit discussion of synchronization as a limitation and the inclusion of world-modeling for predictive cost estimation are constructive contributions. The controls demonstrating sample-level information in the gate scores beyond scenario priors strengthen the case for learned abstention.

major comments (2)

[Abstract] Abstract: The central safety claim—that cost-aware selective inference reduces unsafe false negatives from 17.37% to ~5%—rests on the RGB-physiological student achieving 0.7440 Macro-F1. The same paragraph states that "exact physiological synchronization remains a limitation," which directly undermines confidence that the reported fusion performance (and therefore the gate's effectiveness) would be realized in deployment where HR/EDA signals may exhibit temporal offsets relative to RGB frames.
[Abstract] Abstract: The manuscript reports that "additional controls show that the learned scores contain sample-level information beyond scenario priors," yet provides no quantitative details on the control experiments, ablation results, or statistical tests supporting this claim. This information is load-bearing for validating that the gate is not merely learning scenario-level priors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of the selective-inference approach. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central safety claim—that cost-aware selective inference reduces unsafe false negatives from 17.37% to ~5%—rests on the RGB-physiological student achieving 0.7440 Macro-F1. The same paragraph states that "exact physiological synchronization remains a limitation," which directly undermines confidence that the reported fusion performance (and therefore the gate's effectiveness) would be realized in deployment where HR/EDA signals may exhibit temporal offsets relative to RGB frames.

Authors: The reported performance and safety gains are measured under the experimental condition of window-level aligned physiological signals with RGB frames. The explicit statement that exact synchronization remains a limitation accurately flags that temporal offsets in real deployments could reduce fusion effectiveness and thereby weaken the gate. We will revise the abstract to qualify the results as holding for synchronized inputs and to note the implications for deployment. revision: yes
Referee: [Abstract] Abstract: The manuscript reports that "additional controls show that the learned scores contain sample-level information beyond scenario priors," yet provides no quantitative details on the control experiments, ablation results, or statistical tests supporting this claim. This information is load-bearing for validating that the gate is not merely learning scenario-level priors.

Authors: Quantitative results from the control experiments (ablation of scenario conditioning and statistical tests) appear in Section 4.3. To make the supporting evidence immediately visible in the summary, we will insert a concise statement of the key quantitative findings into the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML results with no self-referential derivations

full rationale

The paper presents an empirical framework for cost-aware selective multimodal driver monitoring, reporting experimental metrics such as 0.7440 Macro-F1 for the RGB-physiological student, reduction of unsafe false negatives from 17.37% to ~5%, and evaluations of a driver-state world modeling module. No equations, first-principles derivations, or predictions are described that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Claims rest on benchmark comparisons and ablation studies rather than renaming known results or smuggling ansatzes via prior self-work. This matches the provided reader's assessment that no abstract-level derivations reduce to inputs by construction, confirming the derivation chain is self-contained and externally falsifiable via reported performance numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the gate and world model are described at high level without detailing their internal assumptions or fitted values.

pith-pipeline@v0.9.1-grok · 5847 in / 1052 out tokens · 32351 ms · 2026-06-26T04:37:28.800284+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 12 canonical work pages · 3 internal anchors

[1]

In: Proceedings of the 36th International Conference on Neural Information Processing Systems

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millicah, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a...

2022
[2]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=33XGfHLtZg

Angelopoulos, A.N., Bates, S., Fisch, A., Lei, L., Schuster, T.: Conformal risk control. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=33XGfHLtZg

2024
[3]

arXiv e-prints arXiv:2404.08471 (Feb 2024).https://doi.org/10.48550/ arXiv.2404.08471

Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting Feature Prediction for Learning Visual Representations from Video. arXiv e-prints arXiv:2404.08471 (Feb 2024).https://doi.org/10.48550/ arXiv.2404.08471

Pith/arXiv arXiv 2024
[4]

Chi, H., Qiu, D., Su, H., Liu, H., Li, Z., Zhang, H., Lv, C.: Driver-wm: A driver- centric traffic-conditioned latent world model for in-cabin dynamics rollout (2026), https://arxiv.org/abs/2605.05092

Pith/arXiv arXiv 2026
[5]

Dhaouadi, J

Chi, H., Yang, H., Yang, L., Lv, C.: Vlm-dm: Visual language models for multitask domain adaptation in driver monitoring. In: 2025 IEEE Intelligent Vehicles Sym- posium (IV). pp. 1280–1285 (2025).https://doi.org/10.1109/IV64158.2025. 11097620

work page doi:10.1109/iv64158.2025 2025
[6]

IEEE Transactions on Information Theory16(1), 41–46 (1970).https://doi.org/10.1109/TIT.1970

Chow, C.: On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory16(1), 41–46 (1970).https://doi.org/10.1109/TIT.1970. 1054406

work page doi:10.1109/tit.1970 1970
[7]

In: Proceedings of the 37th International Conference on Neural Information Processing Systems

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

2023
[8]

Scientific Data11(1), 327 (Mar 2024).https://doi.org/10.1038/s41597-024-03137-y , https://doi

Dargahi Nobari, K., Bertram, T.: A multimodal driver monitoring benchmark dataset for driver modeling in assisted driving automation. Scientific Data11(1), 327 (Mar 2024).https://doi.org/10.1038/s41597-024-03137-y , https://doi. org/10.1038/s41597-024-03137-y

work page doi:10.1038/s41597-024-03137-y 2024
[9]

In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

Geifman, Y., El-Yaniv, R.: Selective classification for deep neural networks. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017),https://proceedings.neurips.cc/paper_files/ paper/2017/file/4a8423d5e91fda00bb7e4654...

2017
[10]

In: Chaudhuri, K., Salakhutdinov, R

Geifman, Y., El-Yaniv, R.: SelectiveNet: A deep neural network with an integrated reject option. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2151–2159. PMLR (09–15 Jun 2019),https://proceedings. mlr.press/v97/geifman19a.html

2019
[11]

World Models

Ha, D., Schmidhuber, J.: World Models. arXiv e-prints arXiv:1803.10122 (Mar 2018).https://doi.org/10.48550/arXiv.1803.10122

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.10122 2018
[12]

In: International Conference on Machine Learning

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning. pp. 2555–2565 (2019)

2019
[13]

arXiv e-prints arXiv:2301.04104 (Jan 2023).https://doi.org/10

Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering Diverse Domains through World Models. arXiv e-prints arXiv:2301.04104 (Jan 2023).https://doi.org/10. 48550/arXiv.2301.04104

Pith/arXiv arXiv 2023
[14]

In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

Hansen,N.,Su,H.,Wang,X.:Td-mpc2:Scalable,robustworldmodelsforcontinuous control. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Conference on Learning Representations. vol. 2024, pp. 47376– 47405 (2024), https://proceedings.iclr.cc/paper_files/paper/2024/file/ cf73d57b6dcda32b293df7c2d5341f49-Paper-Conference.pdf

2024
[15]

IEEE Transactions on Intelligent Transportation Systems6(2), 156–166 (2005).https://doi.org/10.1109/TITS.2005.848368

Healey, J., Picard, R.: Detecting stress during real-world driving tasks using phys- iological sensors. IEEE Transactions on Intelligent Transportation Systems6(2), 156–166 (2005).https://doi.org/10.1109/TITS.2005.848368

work page doi:10.1109/tits.2005.848368 2005
[16]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. arXiv e-prints arXiv:1503.02531 (Mar 2015).https://doi.org/10.48550/arXiv. 1503.02531

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2015
[17]

Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A generative world model for autonomous driving (2023), https://arxiv.org/abs/2309.17080

Pith/arXiv arXiv 2023
[18]

Biomedical Signal Processing and Control93, 106204 (2024).https://doi.org/https://doi

Huang, J., Huang, X., Peng, Y., Hu, L.: Driver state recognition with physiological signals: Based on deep feature fusion and feature selection techniques. Biomedical Signal Processing and Control93, 106204 (2024).https://doi.org/https://doi. org/10.1016/j.bspc.2024.106204, https://www.sciencedirect.com/science/ article/pii/S1746809424002623

work page doi:10.1016/j.bspc.2024.106204 2024
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Jang, J., Ma, C., Lee, B.: Vl2lite: Task-specific knowledge distillation from large vision-language models to lightweight networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 30073–30083 (June 2025)

2025
[20]

In: International Conference on Machine Learning (2020),https: //api.semanticscholar.org/CorpusID:229156320

Koh, P.W., Sagawa, S., Marklund, H., Xie, S.M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R.L., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B.A., Haque, I.S., Beery, S., Leskovec, J., Kundaje, A.B., Pierson, E., Levine, S., Finn, C., Liang, P.: Wilds: A benchmark of in-the-wild distribution shifts. In: International Co...

2020
[21]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp....

2023
[22]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Risk-Aware Selective Multimodal Driver Monitoring 17 Information Processing Systems. vol. 36, pp. 34892–34916. Curran Associates, Inc. (2023), https://proceedings.neurips.cc/paper_files/paper/20...

2023
[23]

In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R

Liu, Z., Wang, Z., Liang, P.P., Salakhutdinov, R.R., Morency, L.P., Ueda, M.: Deep gamblers: Learning to abstain with portfolio theory. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Ad- vances in Neural Information Processing Systems. vol. 32. Curran Associates, Inc. (2019), https://proceedings.neurips.cc/p...

2019
[24]

In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R

Madras, D., Pitassi, T., Zemel, R.: Predict responsibly: Improving fairness and accuracy by learning to defer. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 31. Curran Associates, Inc. (2018),https://proceedings.neurips. cc/paper_files/paper/2018/file/...

2018
[25]

In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Martin, M., Roitberg, A., Haurilet, M., Horne, M., Reiß, S., Voit, M., Stiefelhagen, R.: Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 2801–2810 (2019).https://doi.org/10.1109/ICCV.2019. 00289

work page doi:10.1109/iccv.2019 2019
[26]

Frontiers in PsychologyV olume 12 - 2021(2021)

Meteier, Q., Capallera, M., Ruffieux, S., Angelini, L., Abou Khaled, O., Mugellini, E., Widmer, M., Sonderegger, A.: Classification of drivers’ workload using phys- iological signals in conditional automation. Frontiers in PsychologyV olume 12 - 2021(2021). https://doi.org/10.3389/fpsyg.2021.596038 , https: //www.frontiersin.org/journals/psychology/articl...

work page doi:10.3389/fpsyg.2021.596038 2021
[27]

In: III, H.D., Singh, A

Mozannar, H., Sontag, D.: Consistent estimators for learning to defer to an expert. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 7076–7087. PMLR (13–18 Jul 2020),https://proceedings.mlr.press/v119/ mozannar20b.html

2020
[28]

In: Computer Vision – ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part IV

Ortega, J.D., Kose, N., Cañas, P., Chao, M.A., Unnervik, A., Nieto, M., Otaegui, O., Salgado, L.: Dmd: A large-scale multi-modal driver monitoring dataset for attention and alertness analysis. In: Computer Vision – ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part IV. p. 387–405. Springer-Verlag, Berlin, Heidelberg (2020). https://do...

work page doi:10.1007/978-3-030-66823-5_23 2020
[29]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Resea...

2021
[30]

In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=ryxGuJrFvS

Sagawa*, S., Koh*, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=ryxGuJrFvS

2020
[31]

Sensors12(12), 16937–16953 (2012).https://doi.org/10

Sahayadhas, A., Sundaraj, K., Murugappan, M.: Detecting driver drowsiness based on sensors: A review. Sensors12(12), 16937–16953 (2012).https://doi.org/10. 3390/s121216937,https://www.mdpi.com/1424-8220/12/12/16937 18 D. Qiu et al

2012
[32]

Sensors 23(4) (2023)

Sriranga, A.K., Lu, Q., Birrell, S.: A systematic review of in-vehicle physiological indices and sensor technology for driver mental workload monitoring. Sensors 23(4) (2023). https://doi.org/10.3390/s23042214 , https://www.mdpi.com/ 1424-8220/23/4/2214

work page doi:10.3390/s23042214 2023
[33]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv e-prints arXiv:2409.12191 (Sep 2024). https://doi.org/10.48550/arXiv.2409.12191

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12191 2024
[34]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., Lu, J.: Drivedreamer: Towards real-world-drive world models for autonomous driving. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV
[35]

pp. 55–72. Springer Nature Switzerland, Cham (2025) Risk-Aware Selective Multimodal Driver Monitoring 19 A Additional Discussion This section clarifies several design choices that are central to interpreting the proposed framework. Q1:Why formulate driver monitoring as selective inference rather than simply maximizing classification accuracy? Continuous d...

arXiv 2025

[1] [1]

In: Proceedings of the 36th International Conference on Neural Information Processing Systems

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millicah, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a...

2022

[2] [2]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=33XGfHLtZg

Angelopoulos, A.N., Bates, S., Fisch, A., Lei, L., Schuster, T.: Conformal risk control. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=33XGfHLtZg

2024

[3] [3]

arXiv e-prints arXiv:2404.08471 (Feb 2024).https://doi.org/10.48550/ arXiv.2404.08471

Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting Feature Prediction for Learning Visual Representations from Video. arXiv e-prints arXiv:2404.08471 (Feb 2024).https://doi.org/10.48550/ arXiv.2404.08471

Pith/arXiv arXiv 2024

[4] [4]

Chi, H., Qiu, D., Su, H., Liu, H., Li, Z., Zhang, H., Lv, C.: Driver-wm: A driver- centric traffic-conditioned latent world model for in-cabin dynamics rollout (2026), https://arxiv.org/abs/2605.05092

Pith/arXiv arXiv 2026

[5] [5]

Dhaouadi, J

Chi, H., Yang, H., Yang, L., Lv, C.: Vlm-dm: Visual language models for multitask domain adaptation in driver monitoring. In: 2025 IEEE Intelligent Vehicles Sym- posium (IV). pp. 1280–1285 (2025).https://doi.org/10.1109/IV64158.2025. 11097620

work page doi:10.1109/iv64158.2025 2025

[6] [6]

IEEE Transactions on Information Theory16(1), 41–46 (1970).https://doi.org/10.1109/TIT.1970

Chow, C.: On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory16(1), 41–46 (1970).https://doi.org/10.1109/TIT.1970. 1054406

work page doi:10.1109/tit.1970 1970

[7] [7]

In: Proceedings of the 37th International Conference on Neural Information Processing Systems

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

2023

[8] [8]

Scientific Data11(1), 327 (Mar 2024).https://doi.org/10.1038/s41597-024-03137-y , https://doi

Dargahi Nobari, K., Bertram, T.: A multimodal driver monitoring benchmark dataset for driver modeling in assisted driving automation. Scientific Data11(1), 327 (Mar 2024).https://doi.org/10.1038/s41597-024-03137-y , https://doi. org/10.1038/s41597-024-03137-y

work page doi:10.1038/s41597-024-03137-y 2024

[9] [9]

In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

Geifman, Y., El-Yaniv, R.: Selective classification for deep neural networks. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017),https://proceedings.neurips.cc/paper_files/ paper/2017/file/4a8423d5e91fda00bb7e4654...

2017

[10] [10]

In: Chaudhuri, K., Salakhutdinov, R

Geifman, Y., El-Yaniv, R.: SelectiveNet: A deep neural network with an integrated reject option. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2151–2159. PMLR (09–15 Jun 2019),https://proceedings. mlr.press/v97/geifman19a.html

2019

[11] [11]

World Models

Ha, D., Schmidhuber, J.: World Models. arXiv e-prints arXiv:1803.10122 (Mar 2018).https://doi.org/10.48550/arXiv.1803.10122

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.10122 2018

[12] [12]

In: International Conference on Machine Learning

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning. pp. 2555–2565 (2019)

2019

[13] [13]

arXiv e-prints arXiv:2301.04104 (Jan 2023).https://doi.org/10

Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering Diverse Domains through World Models. arXiv e-prints arXiv:2301.04104 (Jan 2023).https://doi.org/10. 48550/arXiv.2301.04104

Pith/arXiv arXiv 2023

[14] [14]

In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

Hansen,N.,Su,H.,Wang,X.:Td-mpc2:Scalable,robustworldmodelsforcontinuous control. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Conference on Learning Representations. vol. 2024, pp. 47376– 47405 (2024), https://proceedings.iclr.cc/paper_files/paper/2024/file/ cf73d57b6dcda32b293df7c2d5341f49-Paper-Conference.pdf

2024

[15] [15]

IEEE Transactions on Intelligent Transportation Systems6(2), 156–166 (2005).https://doi.org/10.1109/TITS.2005.848368

Healey, J., Picard, R.: Detecting stress during real-world driving tasks using phys- iological sensors. IEEE Transactions on Intelligent Transportation Systems6(2), 156–166 (2005).https://doi.org/10.1109/TITS.2005.848368

work page doi:10.1109/tits.2005.848368 2005

[16] [16]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. arXiv e-prints arXiv:1503.02531 (Mar 2015).https://doi.org/10.48550/arXiv. 1503.02531

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2015

[17] [17]

Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A generative world model for autonomous driving (2023), https://arxiv.org/abs/2309.17080

Pith/arXiv arXiv 2023

[18] [18]

Biomedical Signal Processing and Control93, 106204 (2024).https://doi.org/https://doi

Huang, J., Huang, X., Peng, Y., Hu, L.: Driver state recognition with physiological signals: Based on deep feature fusion and feature selection techniques. Biomedical Signal Processing and Control93, 106204 (2024).https://doi.org/https://doi. org/10.1016/j.bspc.2024.106204, https://www.sciencedirect.com/science/ article/pii/S1746809424002623

work page doi:10.1016/j.bspc.2024.106204 2024

[19] [19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Jang, J., Ma, C., Lee, B.: Vl2lite: Task-specific knowledge distillation from large vision-language models to lightweight networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 30073–30083 (June 2025)

2025

[20] [20]

In: International Conference on Machine Learning (2020),https: //api.semanticscholar.org/CorpusID:229156320

Koh, P.W., Sagawa, S., Marklund, H., Xie, S.M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R.L., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B.A., Haque, I.S., Beery, S., Leskovec, J., Kundaje, A.B., Pierson, E., Levine, S., Finn, C., Liang, P.: Wilds: A benchmark of in-the-wild distribution shifts. In: International Co...

2020

[21] [21]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp....

2023

[22] [22]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Risk-Aware Selective Multimodal Driver Monitoring 17 Information Processing Systems. vol. 36, pp. 34892–34916. Curran Associates, Inc. (2023), https://proceedings.neurips.cc/paper_files/paper/20...

2023

[23] [23]

In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R

Liu, Z., Wang, Z., Liang, P.P., Salakhutdinov, R.R., Morency, L.P., Ueda, M.: Deep gamblers: Learning to abstain with portfolio theory. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Ad- vances in Neural Information Processing Systems. vol. 32. Curran Associates, Inc. (2019), https://proceedings.neurips.cc/p...

2019

[24] [24]

In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R

Madras, D., Pitassi, T., Zemel, R.: Predict responsibly: Improving fairness and accuracy by learning to defer. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 31. Curran Associates, Inc. (2018),https://proceedings.neurips. cc/paper_files/paper/2018/file/...

2018

[25] [25]

In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Martin, M., Roitberg, A., Haurilet, M., Horne, M., Reiß, S., Voit, M., Stiefelhagen, R.: Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 2801–2810 (2019).https://doi.org/10.1109/ICCV.2019. 00289

work page doi:10.1109/iccv.2019 2019

[26] [26]

Frontiers in PsychologyV olume 12 - 2021(2021)

Meteier, Q., Capallera, M., Ruffieux, S., Angelini, L., Abou Khaled, O., Mugellini, E., Widmer, M., Sonderegger, A.: Classification of drivers’ workload using phys- iological signals in conditional automation. Frontiers in PsychologyV olume 12 - 2021(2021). https://doi.org/10.3389/fpsyg.2021.596038 , https: //www.frontiersin.org/journals/psychology/articl...

work page doi:10.3389/fpsyg.2021.596038 2021

[27] [27]

In: III, H.D., Singh, A

Mozannar, H., Sontag, D.: Consistent estimators for learning to defer to an expert. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 7076–7087. PMLR (13–18 Jul 2020),https://proceedings.mlr.press/v119/ mozannar20b.html

2020

[28] [28]

In: Computer Vision – ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part IV

Ortega, J.D., Kose, N., Cañas, P., Chao, M.A., Unnervik, A., Nieto, M., Otaegui, O., Salgado, L.: Dmd: A large-scale multi-modal driver monitoring dataset for attention and alertness analysis. In: Computer Vision – ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part IV. p. 387–405. Springer-Verlag, Berlin, Heidelberg (2020). https://do...

work page doi:10.1007/978-3-030-66823-5_23 2020

[29] [29]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Resea...

2021

[30] [30]

In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=ryxGuJrFvS

Sagawa*, S., Koh*, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=ryxGuJrFvS

2020

[31] [31]

Sensors12(12), 16937–16953 (2012).https://doi.org/10

Sahayadhas, A., Sundaraj, K., Murugappan, M.: Detecting driver drowsiness based on sensors: A review. Sensors12(12), 16937–16953 (2012).https://doi.org/10. 3390/s121216937,https://www.mdpi.com/1424-8220/12/12/16937 18 D. Qiu et al

2012

[32] [32]

Sensors 23(4) (2023)

Sriranga, A.K., Lu, Q., Birrell, S.: A systematic review of in-vehicle physiological indices and sensor technology for driver mental workload monitoring. Sensors 23(4) (2023). https://doi.org/10.3390/s23042214 , https://www.mdpi.com/ 1424-8220/23/4/2214

work page doi:10.3390/s23042214 2023

[33] [33]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv e-prints arXiv:2409.12191 (Sep 2024). https://doi.org/10.48550/arXiv.2409.12191

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12191 2024

[34] [34]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., Lu, J.: Drivedreamer: Towards real-world-drive world models for autonomous driving. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV

[35] [35]

pp. 55–72. Springer Nature Switzerland, Cham (2025) Risk-Aware Selective Multimodal Driver Monitoring 19 A Additional Discussion This section clarifies several design choices that are central to interpreting the proposed framework. Q1:Why formulate driver monitoring as selective inference rather than simply maximizing classification accuracy? Continuous d...

arXiv 2025