Real-Time Driver State Monitoring Using a CNN Based Spatio-Temporal Approach

Alexander Unnervik; Gerhard Rigoll; Neslihan Kose; Okan Kopuklu

arxiv: 1907.08009 · v1 · pith:3FMLIVBDnew · submitted 2019-07-18 · 💻 cs.CV · cs.LG· eess.IV

Real-Time Driver State Monitoring Using a CNN Based Spatio-Temporal Approach

Neslihan Kose , Okan Kopuklu , Alexander Unnervik , Gerhard Rigoll This is my paper

Pith reviewed 2026-05-24 19:52 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.IV

keywords driver distraction detectionspatio-temporal CNNaction recognitionBN-Inceptionoptical flow fusionreal-time classificationDistracted Driver DatasetBrain4Cars

0 comments

The pith

A CNN pipeline using sparse video frames and RGB-optical flow fusion reaches 99.10 percent accuracy classifying ten driver distraction states in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats driver distraction detection as an action recognition problem and shows that spatial features taken from a small number of frames by a pre-trained BN-Inception network, when fused at the data level with optical flow, are enough to exceed earlier accuracy records. The method records 99.10 percent on the ten-class Distracted Driver Dataset while running in real time and gains still more accuracy when the two modalities are combined. The same pipeline also improves results on the Brain4Cars dataset. Because road accidents frequently trace to distraction, even in vehicles with autonomous features, a lightweight monitoring system that does not require full retraining or dense frame sampling would be directly usable for alerts.

Core claim

Our spatio-temporal approach relies on features extracted from sparsely selected frames of an action using a pre-trained BN-Inception network. Experiments show that our approach outperforms the state-of-the-art results on the Distracted Driver Dataset (96.31 percent), with an accuracy of 99.10 percent for ten-class classification while providing real-time performance. We also analyzed the impact of fusion using RGB and optical flow modalities with a very recent data-level fusion strategy. The results on the Distracted Driver and Brain4Cars datasets show that fusion of these modalities further increases the accuracy.

What carries the argument

Spatio-temporal feature extraction from sparsely selected frames by a pre-trained BN-Inception network followed by data-level fusion of RGB and optical flow.

If this is right

The reported 99.10 percent accuracy and real-time speed make the pipeline immediately deployable for in-vehicle driver alerts.
Data-level fusion of RGB and optical flow produces measurable gains on both the Distracted Driver and Brain4Cars datasets.
The approach works without end-to-end fine-tuning or dense frame sampling on the target datasets.
The same sparse-frame strategy can be applied to other driver-state or action-recognition tasks that require low latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lightweight pipeline could be tested on other safety-critical action recognition problems such as factory floor monitoring or elderly fall detection.
Adding a simple temporal model on top of the fused features might raise accuracy further without losing real-time capability.
Evaluating the method on driving footage recorded under rain, night, or unusual camera angles would test whether the pre-trained features generalize beyond the current datasets.

Load-bearing premise

Features taken from only a few frames by a network trained on other data, plus simple modality fusion, already contain all the spatial and temporal cues needed for high-accuracy distraction classification.

What would settle it

Apply the identical pipeline to the Distracted Driver Dataset and obtain either an accuracy below 96.31 percent or a processing speed that cannot sustain real-time operation.

Figures

Figures reproduced from arXiv: 1907.08009 by Alexander Unnervik, Gerhard Rigoll, Neslihan Kose, Okan Kopuklu.

**Figure 2.** Figure 2: Data level fusion of optical flow and color modalities showing an [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Examples from 10 classes of driver postures in the Distracted Driver Dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Confusion matrix using our approach. TABLE III DISTRACTED OR NOT? Classes True Prediction False Prediction Safe Driving 887 35 Distracted Driving 3403 6 belong to distracted driving class. Using the predefined traintest set split, we also analyzed the problem as distracted driver detection. Table III shows the number of true and false predictions evaluated after training our approach including these two c… view at source ↗

read the original abstract

Many road accidents occur due to distracted drivers. Today, driver monitoring is essential even for the latest autonomous vehicles to alert distracted drivers in order to take over control of the vehicle in case of emergency. In this paper, a spatio-temporal approach is applied to classify drivers' distraction level and movement decisions using convolutional neural networks (CNNs). We approach this problem as action recognition to benefit from temporal information in addition to spatial information. Our approach relies on features extracted from sparsely selected frames of an action using a pre-trained BN-Inception network. Experiments show that our approach outperforms the state-of-the art results on the Distracted Driver Dataset (96.31%), with an accuracy of 99.10% for 10-class classification while providing real-time performance. We also analyzed the impact of fusion using RGB and optical flow modalities with a very recent data level fusion strategy. The results on the Distracted Driver and Brain4Cars datasets show that fusion of these modalities further increases the accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They hit 99.10% on the distracted driver set with pre-trained BN-Inception sparse frames plus RGB-flow fusion, but the result sits on unadapted features and thin evaluation details.

read the letter

The paper reports 99.10% accuracy on 10-class driver distraction classification using features from a pre-trained BN-Inception network on sparsely sampled frames, combined with a data-level RGB and optical flow fusion. This beats the cited prior result of 96.31% on the Distracted Driver Dataset and they also show gains on Brain4Cars while claiming real-time speed. The main addition is the application of that recent fusion strategy to this domain with the sparse sampling choice to keep compute low. It demonstrates that off-the-shelf spatial features plus simple modality fusion can improve numbers without end-to-end training or dense frames. The soft spot is exactly the assumption that ImageNet-pretrained features on sparse frames are already good enough. Driver videos differ in lighting, pose, and occlusion, and distraction actions often need finer motion cues than sparse sampling supplies. Without backbone adaptation or ablations on frame count and fusion parameters, the reported margin could shrink. The abstract supplies no error bars, dataset splits, or statistical tests, so the central claim is difficult to assess from the given information. This work is for applied researchers building in-car monitoring systems who need an empirical baseline on public datasets. Readers looking for architectural novelty or rigorous temporal modeling will not find it. Send it to peer review so the full methods and controls can be checked.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a spatio-temporal CNN approach for real-time driver distraction classification, treating the task as action recognition. Features are extracted from sparsely sampled frames using a pre-trained BN-Inception network, with optional data-level fusion of RGB and optical flow modalities. The central empirical claim is 99.10% accuracy on 10-class classification on the Distracted Driver Dataset (outperforming prior SOTA of 96.31%), with real-time performance and further gains from fusion also shown on the Brain4Cars dataset.

Significance. If the results hold under rigorous validation, the work would demonstrate that frozen ImageNet-pretrained features with sparse temporal sampling and simple modality fusion can achieve high accuracy in driver monitoring without end-to-end fine-tuning, offering a practical route to real-time systems.

major comments (2)

[Abstract] Abstract: the 99.10% accuracy claim (and margin over 96.31% SOTA) is stated without error bars, train/test split protocol, number of runs, or statistical tests, preventing assessment of whether the result is reliable or reproducible.
[Approach] Approach (feature extraction): the decision to rely exclusively on a frozen BN-Inception backbone with sparse frame selection is load-bearing for the spatio-temporal claim; no ablation on fine-tuning, domain adaptation, or denser sampling is referenced, leaving open whether ImageNet features suffice for in-car distraction actions.

minor comments (1)

[Abstract] Abstract: the reference to a 'very recent data level fusion strategy' is not accompanied by a citation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the 99.10% accuracy claim (and margin over 96.31% SOTA) is stated without error bars, train/test split protocol, number of runs, or statistical tests, preventing assessment of whether the result is reliable or reproducible.

Authors: The train/test split follows the standard protocol defined by the Distracted Driver Dataset authors and is detailed in Section 4.1 of the manuscript. The reported accuracy is from the single evaluation run used throughout the experiments. We agree the abstract would benefit from a brief reference to the split protocol and will revise it accordingly. Error bars and statistical tests were not computed, as the work emphasizes real-time feasibility over variance analysis; this detail will be noted explicitly in the revised abstract and experimental section. revision: partial
Referee: [Approach] Approach (feature extraction): the decision to rely exclusively on a frozen BN-Inception backbone with sparse frame selection is load-bearing for the spatio-temporal claim; no ablation on fine-tuning, domain adaptation, or denser sampling is referenced, leaving open whether ImageNet features suffice for in-car distraction actions.

Authors: The frozen BN-Inception with sparse sampling was chosen to highlight an efficient, practical pipeline that achieves real-time performance without end-to-end training. While the manuscript does not contain explicit ablations on fine-tuning or denser sampling, the results demonstrate that these pre-trained features, combined with the proposed temporal modeling, exceed prior state-of-the-art. We will add a short discussion in the approach section explaining the design rationale and will reference related work on domain adaptation to address this point. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical reporting of CNN classification accuracies on public datasets.

full rationale

The manuscript contains no equations, derivations, or mathematical claims. It describes an empirical pipeline that extracts features from a publicly available pre-trained BN-Inception network on sparsely sampled frames, applies a standard data-level fusion strategy, and reports classification accuracies on the Distracted Driver Dataset and Brain4Cars. All performance numbers are externally falsifiable by re-running the same pipeline on the same public data; nothing reduces to a self-definition, fitted input renamed as prediction, or self-citation load-bearing step. The central claim (99.10 % accuracy) is therefore a measurement, not a derivation that collapses into its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Ledger entries are inferred solely from the abstract because the full manuscript was not available for inspection.

free parameters (2)

number and selection of sparse frames
The abstract states that features come from sparsely selected frames; the exact count and selection rule are hyperparameters that affect the temporal modeling.
fusion weights or strategy parameters
A very recent data-level fusion strategy is used; its internal parameters are not specified in the abstract.

axioms (1)

domain assumption Features from a pre-trained BN-Inception network transfer effectively to the driver-distraction classification task without domain-specific fine-tuning.
The abstract relies on this transfer without stating validation or adaptation steps.

pith-pipeline@v0.9.0 · 5716 in / 1220 out tokens · 21829 ms · 2026-05-24T19:52:42.084173+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

[1]

National highway trafﬁc safety administration trafﬁc safety facts,

“National highway trafﬁc safety administration trafﬁc safety facts,” https://www.nhtsa.gov/risky-driving/distracted-driving/

work page
[2]

Seeing machines,

“Seeing machines,” http://www.seeingmachines.com

work page
[3]

Valeo driver monitoring,

“Valeo driver monitoring,” https://www.valeo.com/en/ driver-monitoring/

work page
[4]

Eyesight driver sense & cabin sense,

“Eyesight driver sense & cabin sense,” http://www.eyesight-tech.com/ automotive/

work page
[5]

Sensedrive driver monitor system,

“Sensedrive driver monitor system,” https://www.sensetime.com/other/ 597

work page
[6]

Detection of distracted driver using convolutional neural network,

B. Baheti, S. Gajre, and S. Talbar, “Detection of distracted driver using convolutional neural network,” The IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops , June 2018

work page 2018
[7]

Real-time Distracted Driver Posture Classification

Y . Abouelnaga, H. M. Eraqi, and M. N. Moustafa, “Real-time distracted driver posture classiﬁcation,” CoRR, vol. abs/1706.09498, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Motion fused frames: Data level fusion strategy for hand gesture recognition,

O. K ¨op¨ukl¨u, N. K¨ose, and G. Rigoll, “Motion fused frames: Data level fusion strategy for hand gesture recognition,” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops , pp. 2216–2224, 2018

work page 2018
[9]

Car that knows before you do: Anticipating maneuvers via learning temporal driving models,

A. Jain, H. S. Koppula, et al. , “Car that knows before you do: Anticipating maneuvers via learning temporal driving models,” IEEE Int. Conf. on Computer Vision (ICCV) , pp. 3182–3190, 2015

work page 2015
[10]

Development of facial-direction detection sensor,

H. Ishiguro et al., “Development of facial-direction detection sensor,” Proc. 13th ITS World Congr. , pp. 1–8, 2006

work page 2006
[11]

Driver attention dealing with drowsiness and distraction,

A. Nabo, “Driver attention dealing with drowsiness and distraction,” Smart Eye Tech. Report , 2009

work page 2009
[12]

Driver inattention detection based on eye gaze-road event correlation,

L. Fletcher and A. Zelinsky, “Driver inattention detection based on eye gaze-road event correlation,” Int. J. Rob. Res. , vol. 28, no. 6, pp. 774–801, 2009

work page 2009
[13]

Smart eye,

“Smart eye,” http://www.smarteye.se

work page
[14]

General motors,

“General motors,” https://www.gm.com/

work page
[15]

Driver gaze tracking and eyes off the road detection system,

F. Vicente, Z. Huang, et al., “Driver gaze tracking and eyes off the road detection system,” IEEE Trans. on Intelligent Transportation Systems , vol. 16, no. 4, pp. 2014–2027, 2015

work page 2014
[16]

State farm distracted driver detection,

“State farm distracted driver detection,” https://www.kaggle.com/c/ state-farm-distracted-driver-detection

work page
[17]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv 1409.1556, 09 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

Driver behavior recognition based on deep convolutional neural networks,

S. Yan, Y . Teng, J. Smith, and B. Zhang, “Driver behavior recognition based on deep convolutional neural networks,” pp. 636–641, 08 2016

work page 2016
[19]

Transfer learning of temporal information for driver action classiﬁcation,

J. Lemley, S. Bazrafkan, and P. Corcoran, “Transfer learning of temporal information for driver action classiﬁcation,”Modern Artiﬁcial Intelligence and Cognitive Science Conference (MAICS) , 2017

work page 2017
[20]

Distracted driver detection: Deep learning vs handcrafted features,

M. Hssayeni, S. Saxena, et al. , “Distracted driver detection: Deep learning vs handcrafted features,” Electronic Imaging, vol. 2017, pp. 20–26, 01 2017

work page 2017
[21]

Libsvm: A library for support vector machines,

C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector machines,” ACM Trans. Intell. Syst. Technol. , vol. 2, no. 3, pp. 27:1– 27:27, 2011

work page 2011
[22]

Visual categorization with bags of keypoints,

G. Csurka, C. R. Dance, et al. , “Visual categorization with bags of keypoints,” In Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–22, 2004

work page 2004
[23]

Histograms of oriented gradients for human detection,

N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893, 2005

work page 2005
[24]

Distinctive image features from scale-invariant key- points,

D. G. Lowe, “Distinctive image features from scale-invariant key- points,” Int. Journal of Computer Vision , vol. 60, no. 2, pp. 91–110, 2004

work page 2004
[25]

Imagenet classiﬁca- tion with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁca- tion with deep convolutional neural networks,” Neural Information Processing Systems, vol. 25, 01 2012

work page 2012
[26]

Deep residual learning for image recogni- tion,

K. He, X. Zhang, et al. , “Deep residual learning for image recogni- tion,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016

work page 2016
[27]

High accuracy optical ﬂow estimation based on a theory for warping,

T. Brox, A. Bruhn, et al., “High accuracy optical ﬂow estimation based on a theory for warping,” ECCV, 2004

work page 2004
[28]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” Int. Conf. on Machine Learning, pp. 448–456, 2015

work page 2015
[29]

Imagenet large scale visual recogni- tion challenge,

O. Russakovsky, J. Deng, et al., “Imagenet large scale visual recogni- tion challenge,” Int. J. Comput. Vision , vol. 115, no. 3, pp. 211–252, 2015

work page 2015
[30]

Temporal segment networks: Towards good practices for dep action recognition,

L. Wang, Y . Xiong, Z. Wang, et al. , “Temporal segment networks: Towards good practices for dep action recognition,” European Conf. on Computer Vision (ECCV) , pp. 20–36, 2016

work page 2016
[31]

Automatic differentiation in pytorch,

A. Paszke, S. Gross, et al. , “Automatic differentiation in pytorch,” NIPS 2017 Workshop Autodiff , 2017

work page 2017
[32]

Brain4cars: Car that knows before you do via sensory-fusion deep learning architecture,

A. Jain, H. S. Koppula, et al., “Brain4cars: Car that knows before you do via sensory-fusion deep learning architecture,” 2016

work page 2016

[1] [1]

National highway trafﬁc safety administration trafﬁc safety facts,

“National highway trafﬁc safety administration trafﬁc safety facts,” https://www.nhtsa.gov/risky-driving/distracted-driving/

work page

[2] [2]

Seeing machines,

“Seeing machines,” http://www.seeingmachines.com

work page

[3] [3]

Valeo driver monitoring,

“Valeo driver monitoring,” https://www.valeo.com/en/ driver-monitoring/

work page

[4] [4]

Eyesight driver sense & cabin sense,

“Eyesight driver sense & cabin sense,” http://www.eyesight-tech.com/ automotive/

work page

[5] [5]

Sensedrive driver monitor system,

“Sensedrive driver monitor system,” https://www.sensetime.com/other/ 597

work page

[6] [6]

Detection of distracted driver using convolutional neural network,

B. Baheti, S. Gajre, and S. Talbar, “Detection of distracted driver using convolutional neural network,” The IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops , June 2018

work page 2018

[7] [7]

Real-time Distracted Driver Posture Classification

Y . Abouelnaga, H. M. Eraqi, and M. N. Moustafa, “Real-time distracted driver posture classiﬁcation,” CoRR, vol. abs/1706.09498, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Motion fused frames: Data level fusion strategy for hand gesture recognition,

O. K ¨op¨ukl¨u, N. K¨ose, and G. Rigoll, “Motion fused frames: Data level fusion strategy for hand gesture recognition,” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops , pp. 2216–2224, 2018

work page 2018

[9] [9]

Car that knows before you do: Anticipating maneuvers via learning temporal driving models,

A. Jain, H. S. Koppula, et al. , “Car that knows before you do: Anticipating maneuvers via learning temporal driving models,” IEEE Int. Conf. on Computer Vision (ICCV) , pp. 3182–3190, 2015

work page 2015

[10] [10]

Development of facial-direction detection sensor,

H. Ishiguro et al., “Development of facial-direction detection sensor,” Proc. 13th ITS World Congr. , pp. 1–8, 2006

work page 2006

[11] [11]

Driver attention dealing with drowsiness and distraction,

A. Nabo, “Driver attention dealing with drowsiness and distraction,” Smart Eye Tech. Report , 2009

work page 2009

[12] [12]

Driver inattention detection based on eye gaze-road event correlation,

L. Fletcher and A. Zelinsky, “Driver inattention detection based on eye gaze-road event correlation,” Int. J. Rob. Res. , vol. 28, no. 6, pp. 774–801, 2009

work page 2009

[13] [13]

Smart eye,

“Smart eye,” http://www.smarteye.se

work page

[14] [14]

General motors,

“General motors,” https://www.gm.com/

work page

[15] [15]

Driver gaze tracking and eyes off the road detection system,

F. Vicente, Z. Huang, et al., “Driver gaze tracking and eyes off the road detection system,” IEEE Trans. on Intelligent Transportation Systems , vol. 16, no. 4, pp. 2014–2027, 2015

work page 2014

[16] [16]

State farm distracted driver detection,

“State farm distracted driver detection,” https://www.kaggle.com/c/ state-farm-distracted-driver-detection

work page

[17] [17]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv 1409.1556, 09 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [18]

Driver behavior recognition based on deep convolutional neural networks,

S. Yan, Y . Teng, J. Smith, and B. Zhang, “Driver behavior recognition based on deep convolutional neural networks,” pp. 636–641, 08 2016

work page 2016

[19] [19]

Transfer learning of temporal information for driver action classiﬁcation,

J. Lemley, S. Bazrafkan, and P. Corcoran, “Transfer learning of temporal information for driver action classiﬁcation,”Modern Artiﬁcial Intelligence and Cognitive Science Conference (MAICS) , 2017

work page 2017

[20] [20]

Distracted driver detection: Deep learning vs handcrafted features,

M. Hssayeni, S. Saxena, et al. , “Distracted driver detection: Deep learning vs handcrafted features,” Electronic Imaging, vol. 2017, pp. 20–26, 01 2017

work page 2017

[21] [21]

Libsvm: A library for support vector machines,

C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector machines,” ACM Trans. Intell. Syst. Technol. , vol. 2, no. 3, pp. 27:1– 27:27, 2011

work page 2011

[22] [22]

Visual categorization with bags of keypoints,

G. Csurka, C. R. Dance, et al. , “Visual categorization with bags of keypoints,” In Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–22, 2004

work page 2004

[23] [23]

Histograms of oriented gradients for human detection,

N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893, 2005

work page 2005

[24] [24]

Distinctive image features from scale-invariant key- points,

D. G. Lowe, “Distinctive image features from scale-invariant key- points,” Int. Journal of Computer Vision , vol. 60, no. 2, pp. 91–110, 2004

work page 2004

[25] [25]

Imagenet classiﬁca- tion with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁca- tion with deep convolutional neural networks,” Neural Information Processing Systems, vol. 25, 01 2012

work page 2012

[26] [26]

Deep residual learning for image recogni- tion,

K. He, X. Zhang, et al. , “Deep residual learning for image recogni- tion,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016

work page 2016

[27] [27]

High accuracy optical ﬂow estimation based on a theory for warping,

T. Brox, A. Bruhn, et al., “High accuracy optical ﬂow estimation based on a theory for warping,” ECCV, 2004

work page 2004

[28] [28]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” Int. Conf. on Machine Learning, pp. 448–456, 2015

work page 2015

[29] [29]

Imagenet large scale visual recogni- tion challenge,

O. Russakovsky, J. Deng, et al., “Imagenet large scale visual recogni- tion challenge,” Int. J. Comput. Vision , vol. 115, no. 3, pp. 211–252, 2015

work page 2015

[30] [30]

Temporal segment networks: Towards good practices for dep action recognition,

L. Wang, Y . Xiong, Z. Wang, et al. , “Temporal segment networks: Towards good practices for dep action recognition,” European Conf. on Computer Vision (ECCV) , pp. 20–36, 2016

work page 2016

[31] [31]

Automatic differentiation in pytorch,

A. Paszke, S. Gross, et al. , “Automatic differentiation in pytorch,” NIPS 2017 Workshop Autodiff , 2017

work page 2017

[32] [32]

Brain4cars: Car that knows before you do via sensory-fusion deep learning architecture,

A. Jain, H. S. Koppula, et al., “Brain4cars: Car that knows before you do via sensory-fusion deep learning architecture,” 2016

work page 2016