Recognition: unknown
HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment
Pith reviewed 2026-05-10 17:15 UTC · model grok-4.3
The pith
HST-HGN fuses hierarchical hypergraphs with bidirectional state space models to assess driver fatigue from untrimmed videos efficiently.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HST-HGN introduces a heterogeneous spatial-temporal hypergraph network that dynamically fuses pose-disentangled geometric topologies with multi-modal texture patches to model high-order facial synergies, paired with a Bi-Mamba module for bidirectional linear-complexity temporal filtering. This enables distinguishing ambiguous transient actions across their complete physiological lifecycles in untrimmed videos, achieving state-of-the-art performance with computational efficiency suitable for real-time in-cabin edge deployment.
What carries the argument
Hierarchical hypergraph fusion of pose-disentangled geometries and multi-modal texture patches together with Bi-Mamba bidirectional state space modeling, which jointly handles high-order spatial deformations and global temporal evolution.
If this is right
- Distinguishes yawning from speaking by encompassing complete action lifecycles rather than isolated frames.
- Achieves state-of-the-art results across diverse fatigue benchmarks while maintaining linear temporal complexity.
- Enables real-time in-cabin edge deployment by balancing discriminative power and computational efficiency.
- Overcomes the modeling limits of both heavy architectures and traditional pairwise graph networks.
Where Pith is reading between the lines
- The same spatial-temporal fusion could extend to other long-duration subtle action tasks such as micro-expression or posture analysis.
- Linear complexity opens the possibility of scaling to hour-long untrimmed recordings without quadratic cost growth.
- Pairing the model with vehicle telemetry signals might further reduce false positives in real driving conditions.
Load-bearing premise
The fusion of pose-disentangled geometric topologies with multi-modal texture patches in a hierarchical hypergraph, combined with bidirectional Mamba filtering, can reliably distinguish ambiguous transient actions like yawning versus speaking across their complete physiological lifecycles.
What would settle it
A controlled test on untrimmed videos containing extended yawning and speaking sequences where accuracy gains over standard graph or transformer baselines disappear or reverse while compute cost remains higher.
Figures
read the original abstract
It remains challenging to assess driver fatigue from untrimmed videos under constrained computational budgets, due to the difficulty of modeling long-range temporal dependencies in subtle facial expressions. Some existing approaches rely on computationally heavy architectures, whereas others employ traditional lightweight pairwise graph networks, despite their limited capacity to model high-order synergies and global temporal context. Therefore, we propose HST-HGN, a novel Heterogeneous Spatial-Temporal Hypergraph Network driven by Bidirectional State Space Models. Spatially, we introduce a hierarchical hypergraph network to fuse pose-disentangled geometric topologies with multi-modal texture patches dynamically. This formulation encapsulates high-order synergistic facial deformations, effectively overcoming the limitations of conventional methods. In temporal terms, a Bi-Mamba module with linear complexity is applied to perform bidirectional sequence modeling. This explicit temporal-evolution filtering enables the network to distinguish highly ambiguous transient actions, such as yawning versus speaking, while encompassing their complete physiological lifecycles. Extensive evaluations across diverse fatigue benchmarks demonstrate that HST-HGN achieves state-of-the-art performance. In particular, our method strikes a balance between discriminative power and computational efficiency, making it well-suited for real-time in-cabin edge deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes HST-HGN, a Heterogeneous Spatial-Temporal Hypergraph Network with Bidirectional State Space Models (Bi-Mamba) for driver fatigue assessment from untrimmed videos. Spatially, a hierarchical hypergraph fuses pose-disentangled geometric topologies with multi-modal texture patches to capture high-order facial synergies; temporally, Bi-Mamba provides linear-complexity bidirectional filtering to model complete physiological lifecycles and distinguish ambiguous actions such as yawning versus speaking. The central claim is that this architecture achieves state-of-the-art performance while balancing discriminative power and efficiency, making it suitable for real-time in-cabin edge deployment.
Significance. If the experimental claims are substantiated, the work could meaningfully advance real-time fatigue monitoring by offering a more expressive spatial model than pairwise graphs and a more efficient temporal model than transformers, with potential impact on automotive safety systems that require both accuracy on subtle cues and low latency on edge hardware.
major comments (2)
- [Abstract] Abstract: The assertions of state-of-the-art performance across diverse fatigue benchmarks and suitability for real-time edge deployment rest entirely on unshown quantitative results; no accuracy metrics, baseline comparisons, ablation tables, error bars, or dataset details are supplied, rendering the central empirical claim unverifiable from the manuscript.
- [Abstract] Abstract and §4 (presumed experiments): No end-to-end latency, FPS, or memory measurements on representative edge hardware (e.g., Jetson) are reported, nor are ablations that isolate the contribution of the hierarchical hypergraph versus Bi-Mamba; without these, the efficiency half of the claim cannot be evaluated against prior graph or transformer baselines.
minor comments (1)
- [Abstract] Abstract: The phrase 'global fatigue assessment' is introduced without a precise operational definition distinguishing it from local action classification, which could be clarified for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each point below and will revise the manuscript to improve the accessibility and completeness of the empirical claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertions of state-of-the-art performance across diverse fatigue benchmarks and suitability for real-time edge deployment rest entirely on unshown quantitative results; no accuracy metrics, baseline comparisons, ablation tables, error bars, or dataset details are supplied, rendering the central empirical claim unverifiable from the manuscript.
Authors: The full manuscript presents all requested quantitative details in Section 4, including accuracy metrics, baseline comparisons, ablation tables with error bars, and dataset specifications across multiple fatigue benchmarks. The abstract summarizes these findings at a high level due to length constraints. To make the central claims directly verifiable, we will revise the abstract to include key numerical results such as the top accuracy scores and efficiency gains. revision: yes
-
Referee: [Abstract] Abstract and §4 (presumed experiments): No end-to-end latency, FPS, or memory measurements on representative edge hardware (e.g., Jetson) are reported, nor are ablations that isolate the contribution of the hierarchical hypergraph versus Bi-Mamba; without these, the efficiency half of the claim cannot be evaluated against prior graph or transformer baselines.
Authors: Section 4 of the manuscript reports computational complexity and efficiency metrics along with initial ablation studies. We agree that explicit edge-device measurements and isolated component ablations are important for substantiating the efficiency claims. We will revise the manuscript to add end-to-end latency, FPS, and memory results on Jetson hardware and expand the ablations to separately quantify the hierarchical hypergraph and Bi-Mamba contributions relative to graph and transformer baselines. revision: yes
Circularity Check
No circularity: architecture proposal validated by external benchmarks
full rationale
The paper proposes HST-HGN as a novel combination of hierarchical hypergraph fusion (pose-disentangled geometries + multi-modal texture patches) and Bi-Mamba bidirectional sequence modeling for fatigue assessment in untrimmed videos. All central claims of SOTA performance, discriminative power for ambiguous actions, and real-time edge suitability are grounded in extensive evaluations on diverse fatigue benchmarks rather than any internal derivation that reduces to fitted inputs or self-referential definitions. No equations, self-citations as uniqueness theorems, or ansatzes are presented that would create circularity; the model design is justified by its stated capacity to capture high-order synergies and linear-complexity temporal context, with empirical results providing independent external support.
Axiom & Free-Parameter Ledger
free parameters (1)
- model hyperparameters
axioms (1)
- domain assumption Facial expressions in untrimmed videos contain sufficient high-order synergistic information to assess global fatigue states.
Reference graph
Works this paper leans on
-
[1]
El-Rabaie, Khalil F
Samy Abd El-Nabi, Walid El-Shafai, El-Sayed M. El-Rabaie, Khalil F. Ramadan, Fathi E. Abd El-Samie, and Saeed Mohsen. 2024. Machine learning and deep learn- ing techniques for driver fatigue and drowsiness detection: a review.Multimedia Tools and Applications83, 3 (2024), 9441–9477
2024
-
[2]
Shabnam Abtahi, Mona Omidyeganeh, Shervin Shirmohammadi, and Behnoosh Hariri. 2020. YawDD: Yawning Detection Dataset
2020
-
[3]
Safwan Mahmood Al-Selwi, Mohd Fadzil Hassan, Said Jadid Abdulkadir, Amgad Muneer, et al. 2023. LSTM inefficiency in long-term dependencies regression problems.Journal of Advanced Research in Applied Sciences and Engineering Technology30, 3 (2023), 16–31
2023
-
[4]
Xiaopeng An, Lu Su, Qi Yang, Bo Shen, Linhua Gan, Jia jun Ji, Jian Wang, and Haifeng Su. 2025. A spatiotemporal hypergraph self-attention neural networks framework for the identification and pharmacological efficacy assessment of Parkinson’s disease motor symptoms.NPJ Parkinson’s Disease11 (2025)
2025
-
[5]
Regan, Hongbo Jiang, and Licheng Jiao
Jing Bai, Wentao Yu, Zhu Xiao, Vincent Havyarimana, Amelia C. Regan, Hongbo Jiang, and Licheng Jiao. 2022. Two-Stream Spatial–Temporal Graph Convolu- tional Networks for Driver Drowsiness Detection.IEEE Transactions on Cyber- netics52, 12 (2022), 13821–13833
2022
-
[6]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding?. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 813–824
2021
-
[7]
Shengli Cao, Peihua Feng, Wei Kang, Zeyi Chen, and Bo Wang. 2025. Optimized driver fatigue detection method using multimodal neural networks.Scientific Reports15, 1 (2025), 12240
2025
-
[8]
Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4724–4733
2017
-
[9]
Shuxiang Fa, Xiaohui Yang, Shiyuan Han, Zhiquan Feng, and Yuehui Chen. 2023. Multi-scale spatial–temporal attention graph convolutional networks for driver fatigue detection.Journal of Visual Communication and Image Representation93 (2023), 103826
2023
-
[10]
Zunguan Fan, Yifan Feng, Kang Wang, and Xiaoli Li. 2024. Multi-Modal Temporal Hypergraph Neural Network for Flotation Condition Recognition.Entropy26, 3 (2024)
2024
-
[11]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- Fast Networks for Video Recognition. In2019 IEEE/CVF International Conference on Computer Vision (ICCV). 6201–6210
2019
-
[12]
Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. 2019. Hy- pergraph Neural Networks.Proceedings of the AAAI Conference on Artificial Intelligence33, 01 (Jul. 2019), 3558–3565
2019
-
[13]
Biying Fu, Fadi Boutros, Chin-Teng Lin, and Naser Damer. 2024. A Survey on Drowsiness Detection: Modern Applications and Methods.IEEE Transactions on Intelligent Vehicles9, 11 (2024), 7279–7300
2024
- [14]
-
[15]
Albert Gu and Tri Dao. 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Albert Gu, Karan Goel, and Christopher R’e. 2021. Efficiently Modeling Long Sequences with Structured State Spaces.ArXivabs/2111.00396 (2021)
work page internal anchor Pith review arXiv 2021
-
[17]
Qing Han, Shimiao Cui, Weidong Min, Cong Yan, Li Liu, Feng Ning, and Li Li
-
[18]
Scientific Reports15, 1 (2025), 15518
A dense multi-pooling convolutional network for driving fatigue detection. Scientific Reports15, 1 (2025), 15518
2025
-
[19]
Hassan, Ahmed F
Osama F. Hassan, Ahmed F. Ibrahim, Ahmed Gomaa, M. A. Makhlouf, and Bassel Hafiz. 2025. Real-time driver drowsiness detection using transformer architectures: a novel deep learning approach.Scientific Reports15, 1 (2025), 17493
2025
-
[20]
Rui Huang, Yan Wang, Zijian Li, Zeyu Lei, and Yufan Xu. 2022. RF-DCM: Multi- Granularity Deep Convolutional Model Based on Feature Recalibration and Fusion for Driver Fatigue Detection.IEEE Transactions on Intelligent Transporta- tion Systems23, 1 (2022), 630–640
2022
-
[21]
Md Mohaiminul Islam and Gedas Bertasius. 2022. Long Movie Clip Classification with State-Space Video Models. Springer-Verlag, Berlin, Heidelberg, 87–104
2022
-
[22]
Fan Jiang, Qionghao Huang, Xiaoyong Mei, Quanlong Guan, Yaxin Tu, Weiqi Luo, and Changqin Huang. 2023. Face2Nodes: Learning facial expression represen- tations with relation-aware dynamic graph convolution networks.Information Sciences649 (2023), 119640
2023
-
[23]
Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit.J. Mach. Learn. Res. 10 (Dec. 2009), 1755–1758
2009
-
[24]
Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. 2024. VideoMamba: State Space Model for Efficient Video Understanding. InEuropean Conference on Computer Vision. Springer, 237–255
2024
-
[25]
Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. 2024. VMamba: Visual State Space Model. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 103031–103063
2024
-
[26]
Yansha Lu, Chunsheng Liu, Faliang Chang, Hui Liu, and Hengqiang Huan. 2023. JHPFA-Net: Joint Head Pose and Facial Action Network for Driver Yawning Detection Across Arbitrary Poses in Videos.IEEE Transactions on Intelligent Transportation Systems24, 11 (2023), 11850–11863
2023
-
[27]
Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. 2019. Medi- aPipe: A Framework for Building Perception Pipelines. arXiv:1906.08172 [cs.DC]
work page internal anchor Pith review arXiv 2019
- [28]
-
[29]
Abid Ali Minhas, Sohail Jabbar, Muhammad Farhan, and Muhammad Najam ul Islam. 2022. A smart analysis of driver fatigue and drowsiness detection using convolutional neural networks.Multimedia Tools and Applications81, 19 (2022), 26969–26986
2022
-
[30]
Luntian Mou, Chao Zhou, Pengtao Xie, Pengfei Zhao, Ramesh Jain, Wen Gao, and Baocai Yin. 2023. Isotropic Self-Supervised Learning for Driver Drowsi- ness Detection With Attention-Based Multimodal Fusion.IEEE Transactions on Multimedia25 (2023), 529–542
2023
-
[31]
Juan Diego Ortega, Neslihan Kose, Paola Cañas, Min-An Chao, Alexander Un- nervik, Marcos Nieto, Oihana Otaegui, and Luis Salgado. 2020. DMD: A Large- Scale Multi-modal Driver Monitoring Dataset for Attention and Alertness Anal- ysis. InComputer Vision – ECCV 2020 Workshops, Adrien Bartoli and Andrea Fusiello (Eds.). Springer International Publishing, Cham...
2020
-
[32]
Jing Ren, Suyu Ma, Hong Jia, Xiwei Xu, Ivan Lee, Haytham Fayek, Xiaodong Li, and Feng Xia. 2025. LiteFat: Lightweight Spatio-Temporal Graph Learning for Real-Time Driver Fatigue Detection. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 8059–8066
2025
-
[33]
Shaibal Saha and Lanyu Xu. 2025. Vision transformers on the edge: A comprehen- sive survey of model compression and acceleration strategies.Neurocomputing 643 (2025), 130417
2025
-
[34]
Gulbadan Sikander and Shahzad Anwar. 2019. Driver Fatigue Detection Systems: A Review.IEEE Transactions on Intelligent Transportation Systems20, 6 (2019), 2339–2352
2019
-
[35]
Shriyank Somvanshi, Md Monzurul Islam, Mahmuda Sultana Mimi, Saz- zad Bin Bashar Polock, Gaurab Chhetri, and Subasish Das. 2025. From S4 to Mamba: A Comprehensive Survey on Structured State Space Models. arXiv:2503.18970 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jiany- ong Wang, and Furu Wei. 2023. Retentive Network: A Successor to Transformer for Large Language Models. arXiv:2307.08621 [cs.CL]
work page internal anchor Pith review arXiv 2023
-
[37]
Zhichao Sun, Yinan Miao, Jun Young Jeon, Yeseul Kong, and Gyuhae Park. 2023. Facial feature fusion convolutional neural network for driver fatigue detection. Engineering Applications of Artificial Intelligence126 (2023), 106981
2023
-
[38]
Yujin Tang, Peijie Dong, Zhenheng Tang, Xiaowen Chu, and Junwei Liang
-
[39]
VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)(2024), 5663–5673
2024
-
[40]
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 10078–10093
2022
-
[41]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri
-
[42]
In Proceedings of the IEEE International Conference on Computer Vision
Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497
-
[43]
Jue Wang, Wenjie Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, and Raffay Hamid. 2023. Selective Structured State-Spaces for Long-Form Video Understanding.2023 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR)(2023), 6387–6397
2023
-
[44]
Weizheng Wang, Le Mao, Baijian Yang, Guohua Chen, and Byung-Cheol Min
- [45]
-
[46]
Yi Wang, Haoran Luo, Luyang Meng, and Yuying Fan. 2026. MST-HGCN: A multimodal spatio-temporal hypergraph convolutional network for infantile spasms detection.Journal of King Saud University Computer and Information Sciences(2026)
2026
-
[47]
Wijnands, Jason Thompson, Gideon D
Jasper S. Wijnands, Jason Thompson, Gideon D. A. Aschwanden, and Mark Stevenson. 2020. Real-time monitoring of driver drowsiness on mobile platforms using 3D neural networks.Neural Computing and Applications32, 13 (2020), 9731–9743
2020
-
[48]
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähen- bühl, and Ross Girshick. 2019. Long-Term Feature Banks for Detailed Video Understanding. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 284–293
2019
-
[49]
Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. 2021. Autoformer: De- composition Transformers with Auto-Correlation for Long-Term Series Forecast- ing. InAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelz- imer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 22419–22430
2021
-
[50]
Zhize Wu, Yue Ding, Long Wan, Teng Li, and Fudong Nian. 2025. Local and global self-attention enhanced graph convolutional network for skeleton-based action recognition.Pattern Recognition159 (2025), 111106
2025
-
[51]
Cuiliu Yang and Zhao Pei. 2023. Long-Short Term Spatio-Temporal Aggregation for Trajectory Prediction.IEEE Transactions on Intelligent Transportation Systems 24, 4 (2023), 4114–4126
2023
-
[52]
Cong Yang, Zhenyu Yang, Weiyu Li, and John See. 2023. FatigueView: A Multi- Camera Video Dataset for Vision-based Drowsiness Detection.IEEE Transactions on Intelligent Transportation Systems24, 1 (2023), 233–246
2023
-
[53]
Lie Yang, Haohan Yang, Henglai Wei, Zhongxu Hu, and Chen Lv. 2024. Video- Based Driver Drowsiness Detection With Optimised Utilization of Key Facial Features.IEEE Transactions on Intelligent Transportation Systems25, 7 (2024), 6938–6950
2024
-
[54]
Zhimin Zhang, Hongmei Wang, Qian You, Liming Chen, and Huansheng Ning
-
[55]
A novel temporal adaptive fuzzy neural network for facial feature based fatigue assessment.Expert Systems with Applications252 (2024), 124124
2024
-
[56]
Xia Zhao, Limin Wang, Yufei Zhang, Xuming Han, Muhammet Deveci, and Milan Parmar. 2024. A review of convolutional neural networks in computer vision. Artificial Intelligence Review57 (2024), 99
2024
-
[57]
Zuopeng Zhao, Nana Zhou, Lan Zhang, Hualin Yan, Yi Xu, and Zhongxin Zhang
-
[58]
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1155/2020/7251280
Driver Fatigue Detection Based on Convolutional Neural Networks Using EM-CNN.Computational Intelligence and Neuroscience2020, 1 (2020), 7251280. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1155/2020/7251280
-
[59]
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 11106–11115
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.