pith. sign in

arxiv: 2604.19888 · v1 · submitted 2026-04-21 · 💻 cs.CV

SGAP-Gaze: Scene Grid Attention Based Point-of-Gaze Estimation Network for Driver Gaze

Pith reviewed 2026-05-10 03:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords driver gaze estimationpoint-of-gaze predictionscene grid attentiontransformer attentionUD-FSG datasetmulti-modal fusiontraffic scene contexteye tracking
0
0 comments X

The pith

Integrating traffic scene images via grid attention reduces driver point-of-gaze error by 23.5 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the UD-FSG dataset of synchronized driver face and traffic scene images to supply contextual cues missing from face-only models. It presents the SGAP-Gaze network that first fuses face, eye, and iris features into a gaze intent vector and then applies a transformer attention mechanism over a spatial grid of the scene image to predict the point of gaze. The resulting model records a mean pixel error of 104.73 on UD-FSG and 63.48 on the LBW dataset. These figures represent a 23.5 percent improvement over prior driver gaze estimators and hold across all scene regions, including the outer areas that matter for situational awareness. The work shows that explicit scene context can make gaze prediction more reliable in real driving conditions.

Core claim

SGAP-Gaze integrates driver face, eye, iris, and scene contextual information. Facial modality features are fused into a gaze intent vector. Attention scores are then computed over the spatial scene grid with a Transformer-based mechanism that fuses face and scene features to produce the point-of-gaze. On the UD-FSG dataset this yields a mean pixel error of 104.73 and on the LBW dataset 63.48, a 23.5 percent reduction relative to state-of-the-art models. The model maintains lower error across all spatial ranges, including the outer regions of the scene.

What carries the argument

Scene grid attention, a Transformer-based mechanism that computes attention scores over a spatial grid of the scene image after fusing it with a facial gaze intent vector to produce the final point-of-gaze estimate.

If this is right

  • Mean pixel error stays lower than prior methods across every spatial range of the scene.
  • Gains are largest in outer scene regions that occur infrequently yet are critical for driver attention.
  • The multi-modal fusion of face and scene information produces more robust estimates in real-world driving environments.
  • Performance improves on both the new UD-FSG dataset and the existing LBW dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In-vehicle monitoring systems could adopt the same scene-grid attention to detect driver inattention more reliably and trigger earlier alerts.
  • The same attention pattern might be tested on non-driving gaze tasks such as human-robot interaction or video analysis.
  • Additional synchronized face-scene datasets collected under varied lighting or weather would provide a direct check on how much the contextual cues generalize.

Load-bearing premise

Scene images supply independent and useful contextual cues about surrounding traffic that the attention mechanism can reliably exploit to improve point-of-gaze predictions beyond facial features alone.

What would settle it

An ablation that removes the scene-grid attention module and shows no meaningful rise in error, or a test on a new driving dataset where face and scene cues are uncorrelated, would indicate whether the reported gains depend on the assumed contextual value of the scene images.

Figures

Figures reproduced from arXiv: 2604.19888 by Pavan Kumar Sharma, Pranamesh Chakraborty.

Figure 1
Figure 1. Figure 1: Overview of the data collection setup: (a) Installation of face and scene cameras (b) Captured driver face and scene images 2) Participants: The present study requires driver face and gaze information in addition to the forward traffic scene to build the driver gaze estimation model. Since this study involves human participants (drivers), ethical approval was obtained from the Institute Ethics Committee be… view at source ↗
Figure 3
Figure 3. Figure 3: The traffic environment consisting of diverse dynamic [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample face–scene image pairs from the UD-FSG dataset, illustrating variations in traffic density and lighting conditions [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of driver gaze point distribution of UD-FSG dataset appearance including face features, eye features, and iris location and scene context. Instead of directly regressing gaze coordinates, the proposed approach formulates gaze estima￾tion as a problem over spatial scene regions. The proposed architecture consists of four major components: (1) Facial geometry detection module (2) Multi-stream f… view at source ↗
Figure 5
Figure 5. Figure 5: Problem Formulation: Given synchronized inputs of a facial image (F) and the corresponding scene image (S), the gaze location (gaze) is modeled as a function of the face features and scene features: gaze = f(F, S) (1) Since facial image consists of multiple features at various scales, including eyes, facial feature extraction has been further divided into: (i) Multi-level face features and (ii) Left and ri… view at source ↗
Figure 5
Figure 5. Figure 5: Proposed driver gaze estimation pipeline fusing face and scene information using transformer. across layers, we project each feature map into a unified 256-D embedding space using a learnable 1 × 1 convolution: Fˆ l = ϕl(Fl), Fˆ l ∈ R 256×Hl×Wl (3) where, ϕl(·) represents the channel projection operation. Finally, to obtain a compact global representation, adaptive global average pooling (GAP) is applied o… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of face features extracted from different layers of ResNet-18 for three drivers. 2) Gaussian-Weighted Eye Feature Extraction: Along with the overall face features, the eyes, in addition with the iris position, are extremely important to determine the gaze lo￾cation of the participants. Therefore, we design an efficient feature extraction of the eye region, detected using our FEI model, along … view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of eye features extracted from Layer-4 of ResNet-18 with constant padding value of 114, comparing un￾weighted and Gaussian-weighted feature responses emphasizing the iris region. 3) Scene Feature Extraction: The scene image contains the contextual information of the driving environment, which influences the driver gaze. To extract this information, we employ a pretrained ResNet-18 backbone as… view at source ↗
Figure 8
Figure 8. Figure 8: shows a few samples of PoG predictions with varying error magnitudes. In top sample, the model achieves high accuracy with a pixel error of 11 pixels, indicating that the predicted gaze location is very close to the ground truth. In contrast, the bottom-right sample shows a larger error of 133 pixels, where the predicted gaze location deviates significantly from the actual gaze point. These higher errors t… view at source ↗
read the original abstract

Driver gaze estimation is essential for understanding the driver's situational awareness of surrounding traffic. Existing gaze estimation models use driver facial information to predict the Point-of-Gaze (PoG) or the 3D gaze direction vector. We propose a benchmark dataset, Urban Driving-Face Scene Gaze (UD-FSG), comprising synchronized driver-face and traffic-scene images. The scene images provide cues about surrounding traffic, which can help improve the gaze estimation model, along with the face images. We propose SGAP-Gaze, Scene-Grid Attention based Point-of-Gaze estimation network, trained and tested on our UD-FSG dataset, which explicitly incorporates the scene images into the gaze estimation modelling. The gaze estimation network integrates driver face, eye, iris, and scene contextual information. First, the extracted features from facial modalities are fused to form a gaze intent vector. Then, attention scores are computed over the spatial scene grid using a Transformer-based attention mechanism fusing face and scene image features to obtain the PoG. The proposed SGAP-Gaze model achieves a mean pixel error of 104.73 on the UD-FSG dataset and 63.48 on LBW dataset, achieving a 23.5% reduction in mean pixel error compared to state-of-the-art driver gaze estimation models. The spatial pixel distribution analysis shows that SGAP-Gaze consistently achieves lower mean pixel error than existing methods across all spatial ranges, including the outer regions of the scene, which are rare but critical for understanding driver attention. These results highlight the effectiveness of integrating multi-modal gaze cues with scene-aware attention for a robust driver PoG estimation model in real-world driving environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Urban Driving-Face Scene Gaze (UD-FSG) dataset consisting of synchronized driver face and traffic scene images, and proposes the SGAP-Gaze network. Facial features from face, eye, and iris are fused into a gaze intent vector; a Transformer-based attention mechanism then computes scores over a spatial scene grid to fuse scene context and predict point-of-gaze (PoG). The model reports a mean pixel error of 104.73 on UD-FSG and 63.48 on the LBW dataset, corresponding to a 23.5% reduction relative to prior state-of-the-art driver gaze estimators, with additional claims of improved performance across spatial regions including outer scene areas.

Significance. If substantiated, the work would advance driver monitoring by showing that explicit scene context can improve PoG prediction beyond facial cues alone, with direct relevance to situational-awareness systems in autonomous driving. The paired face-scene dataset is a concrete contribution that enables future scene-aware gaze research. The cross-dataset evaluation on LBW and the spatial error distribution analysis are positive elements that strengthen the empirical case.

major comments (2)
  1. [Experiments] Experiments section: No ablation is reported that removes the scene-grid attention branch while retaining the multi-modal face/eye/iris fusion and Transformer components. Without this internal baseline, the 23.5% mean-pixel-error reduction cannot be attributed specifically to scene contextual cues rather than to the richer facial feature extractor or overall architecture.
  2. [Dataset and Methods] Dataset and Methods sections: The UD-FSG dataset description omits the total number of synchronized pairs, the train/test split ratios, and the procedure used to obtain PoG ground-truth labels. These details are required to assess whether the reported errors (104.73 on UD-FSG, 63.48 on LBW) are reproducible and statistically reliable.
minor comments (2)
  1. [Abstract] The abstract refers to a 'spatial pixel distribution analysis' but does not cite the corresponding figure or table that presents this analysis.
  2. [Method] Notation for the gaze intent vector and the scene-grid attention scores would be clearer if accompanied by explicit equations defining all variables and dimensions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These points highlight important aspects for strengthening the empirical claims and reproducibility of the work. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: No ablation is reported that removes the scene-grid attention branch while retaining the multi-modal face/eye/iris fusion and Transformer components. Without this internal baseline, the 23.5% mean-pixel-error reduction cannot be attributed specifically to scene contextual cues rather than to the richer facial feature extractor or overall architecture.

    Authors: We agree that the current experiments do not isolate the contribution of the scene-grid attention branch from the multi-modal facial fusion and Transformer components. To address this, we will add a dedicated ablation study in the revised Experiments section. This ablation will remove only the scene-grid attention mechanism while preserving the face/eye/iris feature extraction, fusion into the gaze intent vector, and Transformer-based processing, allowing direct comparison of mean pixel error with and without scene context. revision: yes

  2. Referee: [Dataset and Methods] Dataset and Methods sections: The UD-FSG dataset description omits the total number of synchronized pairs, the train/test split ratios, and the procedure used to obtain PoG ground-truth labels. These details are required to assess whether the reported errors (104.73 on UD-FSG, 63.48 on LBW) are reproducible and statistically reliable.

    Authors: We acknowledge the omission of these critical dataset details from the manuscript. In the revised version, we will expand the Dataset section to explicitly report the total number of synchronized driver face-scene image pairs in UD-FSG, the train/test split ratios used for model training and evaluation, and the precise procedure employed to obtain the point-of-gaze ground-truth labels. These additions will support reproducibility and allow readers to better evaluate the reliability of the reported errors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out and external test sets

full rationale

The paper introduces a new dataset (UD-FSG) and a neural network architecture (SGAP-Gaze) that fuses face/eye/iris features into a gaze intent vector then applies Transformer-based scene-grid attention. Performance is measured as mean pixel error on a held-out test split of UD-FSG and on the external LBW dataset, with comparisons only to external SOTA models. No equations, first-principles derivations, or parameter-fitting steps are presented that reduce the reported error reductions to the training inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims remain falsifiable via the reported test metrics and do not collapse into self-definition or renamed known results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that scene context adds independent signal and that the transformer attention can extract it; the model itself contains many learned parameters but no hand-tuned constants beyond standard training.

free parameters (1)
  • Neural network weights and attention parameters
    All weights are fitted during training on the UD-FSG dataset.
axioms (1)
  • domain assumption Synchronized scene images contain traffic cues that are relevant to the driver's point of gaze
    Invoked when the model fuses scene features with facial intent vector.

pith-pipeline@v0.9.0 · 5605 in / 1342 out tokens · 47432 ms · 2026-05-10T03:28:40.086561+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    Driver distraction and driver inattention: Definition, relationship and taxonomy,

    M. A. Regan, C. Hallett, and C. P. Gordon, “Driver distraction and driver inattention: Definition, relationship and taxonomy,”Accident Analysis & Prevention, vol. 43, no. 5, pp. 1771–1781, 2011

  2. [2]

    Self-calibrated driver gaze estimation via gaze pattern learning,

    G. Yuan, Y . Wang, H. Yan, and X. Fu, “Self-calibrated driver gaze estimation via gaze pattern learning,”Knowledge-Based Systems, vol. 235, p. 107630, 2022

  3. [3]

    Road accidents in india 2023,

    Ministry of Road Transport and Highways, “Road accidents in india 2023,” 2025. [Online]. Available: https://morth.nic.in/backend/ documents/uploaded/Road-Accident-in-India-2023-Publications.pdf

  4. [4]

    Driver inattention monitoring system for intelligent vehicles: A review,

    Y . Dong, Z. Hu, K. Uchimura, and N. Murayama, “Driver inattention monitoring system for intelligent vehicles: A review,”IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 2, pp. 596–614, 2010

  5. [5]

    Drivers’ visual scanning behavior at signalized and unsignalized intersections: A naturalistic driving study in china,

    G. Li, Y . Wang, F. Zhu, X. Sui, N. Wang, X. Qu, and P. Green, “Drivers’ visual scanning behavior at signalized and unsignalized intersections: A naturalistic driving study in china,”Journal of Safety Research, vol. 71, pp. 219–229, 2019

  6. [6]

    Gaze-based intention anticipation over driving ma- noeuvres in semi-autonomous vehicles,

    M. Wu, T. Louw, M. Lahijanian, W. Ruan, X. Huang, N. Merat, and M. Kwiatkowska, “Gaze-based intention anticipation over driving ma- noeuvres in semi-autonomous vehicles,” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 6210–6216

  7. [7]

    Drivers use active gaze to monitor waypoints during automated driving,

    C. Mole, J. Pekkanen, W. E. Sheppard, G. Markkula, and R. M. Wilkie, “Drivers use active gaze to monitor waypoints during automated driving,”Scientific Reports, vol. 11, no. 1, p. 263, 2021

  8. [8]

    Recent trends in driver safety monitoring systems: State of the art and challenges,

    A. Koesdwiady, R. Soua, F. Karray, and M. S. Kamel, “Recent trends in driver safety monitoring systems: State of the art and challenges,”IEEE Transactions on Vehicular Technology, vol. 66, no. 6, pp. 4550–4563, 2016

  9. [9]

    Innovative framework for distracted-driving alert system based on deep learning,

    P.-W. Lin and C.-M. Hsu, “Innovative framework for distracted-driving alert system based on deep learning,”IEEE Access, vol. 10, pp. 77 523– 77 536, 2022

  10. [10]

    Real-time monitoring of driver distraction: State-of-the-art and future insights,

    E. Michelaraki, C. Katrakazas, S. Kaiser, T. Brijs, and G. Yannis, “Real-time monitoring of driver distraction: State-of-the-art and future insights,”Accident Analysis & Prevention, vol. 192, p. 107241, 2023

  11. [11]

    A review of driver gaze estimation and application in gaze behavior understanding,

    P. K. Sharma and P. Chakraborty, “A review of driver gaze estimation and application in gaze behavior understanding,”Engineering Applications of Artificial Intelligence, vol. 133, p. 108117, 2024

  12. [12]

    Driver gaze region estimation without use of eye movement,

    L. Fridman, P. Langhans, J. Lee, and B. Reimer, “Driver gaze region estimation without use of eye movement,”IEEE Intelligent Systems, vol. 31, no. 3, pp. 49–56, 2016

  13. [13]

    The multimodal driver monitoring database: A naturalistic corpus to study driver attention,

    S. Jha, M. F. Marzban, T. Hu, M. H. Mahmoud, N. Al-Dhahir, and C. Busso, “The multimodal driver monitoring database: A naturalistic corpus to study driver attention,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 10 736–10 752, 2021

  14. [14]

    Look both ways: Self-supervising driver gaze estimation and road scene saliency,

    I. Kasahara, S. Stent, and H. S. Park, “Look both ways: Self-supervising driver gaze estimation and road scene saliency,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 126–142

  15. [15]

    What do you see in vehicle? comprehensive vision solution for in-vehicle gaze estimation,

    Y . Cheng, Y . Zhu, Z. Wang, H. Hao, Y . Liu, S. Cheng, X. Wang, and H. J. Chang, “What do you see in vehicle? comprehensive vision solution for in-vehicle gaze estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1556–1565

  16. [16]

    Driver visual attention estimation using head pose and eye appearance information,

    S. Jha, N. Al-Dhahir, and C. Busso, “Driver visual attention estimation using head pose and eye appearance information,”IEEE Open Journal of Intelligent Transportation Systems, vol. 4, pp. 216–231, 2023

  17. [17]

    Lnet: Lightweight network for driver attention estimation via scene and gaze consistency,

    D. Hu, X. Li, M. Cui, and K. Huang, “Lnet: Lightweight network for driver attention estimation via scene and gaze consistency,”IEEE Transactions on Image Processing, vol. 35, pp. 27–41, 2025

  18. [18]

    Dr (eye) ve: a dataset for attention-based tasks with applications to au- tonomous and assisted driving,

    S. Alletto, A. Palazzi, F. Solera, S. Calderara, and R. Cucchiara, “Dr (eye) ve: a dataset for attention-based tasks with applications to au- tonomous and assisted driving,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 54– 60

  19. [19]

    Data-driven estimation of driver attention using calibration-free eye gaze and scene features,

    Z. Hu, C. Lv, P. Hang, C. Huang, and Y . Xing, “Data-driven estimation of driver attention using calibration-free eye gaze and scene features,” IEEE Transactions on Industrial Electronics, vol. 69, no. 2, pp. 1800– 1808, 2021

  20. [20]

    Drivers’ visual attention: A field study at intersections,

    S. Lemonnier, L. D ´esir´e, R. Br ´emond, and T. Baccino, “Drivers’ visual attention: A field study at intersections,”Transportation Research Part F: Traffic Psychology and Behaviour, vol. 69, pp. 206–221, 2020

  21. [21]

    Hey, watch where you’re going! an on-road study of driver scanning failures towards pedestrians and cyclists,

    N. Kaya, J. Girgis, B. Hansma, and B. Donmez, “Hey, watch where you’re going! an on-road study of driver scanning failures towards pedestrians and cyclists,”Accident Analysis & Prevention, vol. 162, p. 106380, 2021

  22. [22]

    Driver gaze zone estimation using convolutional neural networks: A general framework and ablative analysis,

    S. V ora, A. Rangesh, and M. M. Trivedi, “Driver gaze zone estimation using convolutional neural networks: A general framework and ablative analysis,”IEEE Transactions on Intelligent Vehicles, vol. 3, no. 3, pp. 254–265, 2018

  23. [23]

    Speak2label: Using domain knowledge for creating a large scale driver gaze zone estimation dataset,

    S. Ghosh, A. Dhall, G. Sharma, S. Gupta, and N. Sebe, “Speak2label: Using domain knowledge for creating a large scale driver gaze zone estimation dataset,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2896–2905

  24. [24]

    Multi- task driver gaze estimation in real world driving scenes,

    X. Wu, L. Li, G. Zhou, Q. Wu, X. Zuo, H. Zhu, and S. He, “Multi- task driver gaze estimation in real world driving scenes,”Engineering Applications of Artificial Intelligence, vol. 160, p. 111892, 2025

  25. [25]

    A reduced feature set for driver head pose estimation,

    K. Diaz-Chito, A. Hern ´andez-Sabat´e, and A. M. L ´opez, “A reduced feature set for driver head pose estimation,”Applied Soft Computing, vol. 45, pp. 98–107, 2016

  26. [26]

    Driver gaze estimation in the real world: Overcoming the eyeglass challenge,

    A. Rangesh, B. Zhang, and M. M. Trivedi, “Driver gaze estimation in the real world: Overcoming the eyeglass challenge,” in2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2020, pp. 1054–1059

  27. [27]

    Driver gaze zone dataset with depth data,

    R. F. Ribeiro and P. D. Costa, “Driver gaze zone dataset with depth data,” in2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 2019, pp. 1–5

  28. [28]

    Dmd: A large-scale multi-modal driver monitoring dataset for attention and alertness analysis,

    J. D. Ortega, N. Kose, P. Ca ˜nas, M.-A. Chao, A. Unnervik, M. Nieto, O. Otaegui, and L. Salgado, “Dmd: A large-scale multi-modal driver monitoring dataset for attention and alertness analysis,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 387–405

  29. [29]

    Evaluation of data collection and an- notation approaches of driver gaze dataset,

    P. K. Sharma and P. Chakraborty, “Evaluation of data collection and an- notation approaches of driver gaze dataset,”Behavior Research Methods, vol. 57, no. 6, p. 172, 2025

  30. [30]

    Predicting the driver’s focus of attention: the dr (eye) ve project,

    A. Palazzi, D. Abati, F. Solera, R. Cucchiaraet al., “Predicting the driver’s focus of attention: the dr (eye) ve project,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1720– 1733, 2018

  31. [31]

    Dgaze: Driver gaze mapping on road,

    I. Dua, T. A. John, R. Gupta, and C. Jawahar, “Dgaze: Driver gaze mapping on road,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5946–5953

  32. [32]

    A dual-cameras-based driver gaze mapping system with an application on non-driving activities monitoring,

    L. Yang, K. Dong, A. J. Dmitruk, J. Brighton, and Y . Zhao, “A dual-cameras-based driver gaze mapping system with an application on non-driving activities monitoring,”IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 10, pp. 4318–4327, 2019