MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data
Pith reviewed 2026-05-22 06:30 UTC · model grok-4.3
The pith
MambaGaze uses XMD encoding and bidirectional Mamba-2 to handle missing eye-gaze data for cognitive load assessment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By augmenting raw gaze features with observation masks and inter-sample time deltas (XMD encoding) and feeding the augmented sequence into bidirectional Mamba-2 blocks, the model captures both the uncertainty caused by missing samples and the long-range temporal structure of gaze behavior, yielding 4-12 percentage-point gains over CNN, Transformer, ResNet and VGG baselines on the CLARE and CL-Drive benchmarks.
What carries the argument
XMD encoding, which concatenates binary observation masks and elapsed-time deltas to each gaze feature vector, combined with bidirectional Mamba-2 layers that replace self-attention for linear-complexity sequence modeling.
If this is right
- The linear scaling of Mamba-2 permits longer gaze windows than quadratic transformers without exceeding power budgets on embedded hardware.
- Explicit missing-data channels allow the same architecture to ingest other irregularly sampled biosignals such as pupil diameter or blink rate.
- Real-time inference at 43-68 FPS below 7.5 W on Jetson platforms meets the latency and energy constraints of wearable driver-monitoring systems.
- Leave-one-subject-out gains of 4-12 points indicate the method generalizes across individuals rather than memorizing per-person patterns.
Where Pith is reading between the lines
- The same XMD-plus-Mamba pattern could be tested on other physiological streams that suffer dropouts, such as EEG or photoplethysmography.
- Because Mamba layers scale linearly, the framework might support continuous multi-hour recordings where transformer baselines would run out of memory or power.
- If the time-delta channel proves critical, future variants could learn adaptive sampling rates that reduce missingness at the source.
Load-bearing premise
The missingness patterns and subject variability in the CLARE and CL-Drive datasets under leave-one-subject-out splits are representative of what will appear in live safety-critical deployments.
What would settle it
Measure accuracy on a fresh eye-gaze dataset collected from a different age group or lighting condition; if MambaGaze falls below the CNN or Transformer baseline under the same protocol, the central claim does not hold.
Figures
read the original abstract
Real-time cognitive load assessment from eye-tracking signals could potentially enable adaptive human-centered-AI such as safety-critical applications such as driver vigilance monitoring or automated flight deck assistance, yet two challenges persist: handling frequent data missingness from blinks and tracking failures, and efficiently modeling long-range temporal dependencies. We propose MambaGaze, a framework that addresses these challenges through 1) XMD encoding, which augments raw features with observation masks and time-deltas to explicitly model data uncertainty, and 2) bidirectional Mamba-2, which captures temporal dependencies with linear computational complexity. Experiments on CLARE and CL-Drive datasets under leave-one-subject-out evaluation show that MambaGaze achieves 76.8% and 73.1% accuracy, respectively, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment benchmarks on NVIDIA Jetson platforms demonstrate real-time inference at 43-68 FPS with power consumption below 7.5W, confirming feasibility for wearable cognitive load monitoring.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MambaGaze, a framework for cognitive load assessment from eye-gaze tracking data that combines XMD encoding (observation masks and time-deltas to handle missingness from blinks/tracking failures) with a bidirectional Mamba-2 architecture for linear-complexity long-range temporal modeling. Under leave-one-subject-out evaluation on the CLARE and CL-Drive datasets, it reports 76.8% and 73.1% accuracy respectively, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points, while achieving 43-68 FPS at under 7.5 W on NVIDIA Jetson platforms.
Significance. If the performance gains and efficiency claims hold under more rigorous validation, the work would offer a practical advance for real-time, wearable cognitive-load monitoring in safety-critical settings by addressing both missing-data uncertainty and long-sequence modeling without quadratic attention costs. The explicit XMD augmentation and edge-deployment benchmarks are concrete strengths that distinguish it from prior gaze-based or Mamba-based approaches.
major comments (3)
- [§4] §4 (Experimental Results): The headline accuracies (76.8 % on CLARE, 73.1 % on CL-Drive) and 4–12 pp gains are presented without error bars, standard deviations across folds, or any statistical significance testing (e.g., paired t-test or McNemar). This makes it impossible to determine whether the reported improvements over the CNN/Transformer baselines are reliable or could be explained by random variation under the LOSO protocol.
- [§4.3] §4.3 (Ablation Studies): No ablation is reported that isolates the contribution of the XMD mask+delta encoding versus the bidirectional Mamba-2 backbone alone. Without this, the central claim that “explicit missing data modeling” drives the performance advantage cannot be substantiated, especially given that the baselines may not have received equivalent missing-data handling.
- [§5] §5 (Discussion / Generalization): The paper anchors its safety-critical deployment argument to the missingness statistics of CLARE and CL-Drive, yet provides no analysis or sensitivity experiment showing how performance changes under longer contiguous dropouts or different missingness correlations with cognitive load. This directly affects the representativeness assumption highlighted in the stress-test note.
minor comments (3)
- [Abstract / §3.2] The abstract and §3.2 should explicitly state the hyper-parameter search procedure and the exact Mamba-2 configuration (state dimension, layer count) used for the reported numbers.
- [Figure 3] Figure 3 (architecture diagram) would benefit from clearer annotation of how the XMD features are concatenated with the raw gaze vectors before the bidirectional Mamba blocks.
- [Table 2] Table 2 (baseline comparison) should include the number of parameters and FLOPs for each model to make the efficiency comparison with MambaGaze more complete.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have addressed each major comment by adding the requested statistical analyses, ablations, and sensitivity experiments in the revised manuscript. Below we respond point by point.
read point-by-point responses
-
Referee: [§4] The headline accuracies (76.8 % on CLARE, 73.1 % on CL-Drive) and 4–12 pp gains are presented without error bars, standard deviations across folds, or any statistical significance testing (e.g., paired t-test or McNemar). This makes it impossible to determine whether the reported improvements over the CNN/Transformer baselines are reliable or could be explained by random variation under the LOSO protocol.
Authors: We agree that variability measures and statistical testing are necessary to establish the reliability of the reported gains. In the revised manuscript we now report mean accuracy ± standard deviation across all LOSO folds for every model. We additionally performed McNemar’s test on the per-subject predictions; the improvements of MambaGaze over the strongest baseline remain statistically significant (p < 0.01 on CLARE, p < 0.05 on CL-Drive). These results appear in the updated Table 2 and are discussed in Section 4. revision: yes
-
Referee: [§4.3] No ablation is reported that isolates the contribution of the XMD mask+delta encoding versus the bidirectional Mamba-2 backbone alone. Without this, the central claim that “explicit missing data modeling” drives the performance advantage cannot be substantiated, especially given that the baselines may not have received equivalent missing-data handling.
Authors: We acknowledge the value of isolating the two components. We have added a dedicated ablation study (new Table 4 in Section 4.3) that compares (i) bidirectional Mamba-2 with standard mean imputation, (ii) XMD encoding paired with a Transformer backbone, and (iii) the full MambaGaze model. The results show that adding XMD to Mamba-2 yields an additional 3.9 pp and 4.1 pp on CLARE and CL-Drive, respectively, while the complete model outperforms the XMD+Transformer variant. All baselines have been re-implemented with consistent missing-value handling (linear interpolation) to ensure fair comparison. revision: yes
-
Referee: [§5] The paper anchors its safety-critical deployment argument to the missingness statistics of CLARE and CL-Drive, yet provides no analysis or sensitivity experiment showing how performance changes under longer contiguous dropouts or different missingness correlations with cognitive load. This directly affects the representativeness assumption highlighted in the stress-test note.
Authors: We agree that robustness under more severe or correlated missingness is important for the safety-critical claims. In the revised Section 5 we now include a sensitivity analysis that injects contiguous missing segments of 1–8 s and missingness correlated with high cognitive-load periods. MambaGaze retains >68 % accuracy even at 25 % contiguous missing data and continues to outperform the baselines. We discuss how the explicit XMD encoding and bidirectional state-space modeling contribute to this resilience and note the remaining limitations when missingness exceeds the patterns observed in the two datasets. revision: yes
Circularity Check
No circularity: empirical accuracies measured on held-out subjects
full rationale
The paper's central results are empirical accuracies (76.8% CLARE, 73.1% CL-Drive) obtained via leave-one-subject-out evaluation on two external datasets. These are direct measurements on held-out data rather than quantities derived by construction from fitted parameters, model equations, or training-set statistics. The XMD mask+delta encoding and bidirectional Mamba-2 are architectural choices whose performance is tested experimentally; no self-definitional reduction, fitted-input-as-prediction, or load-bearing self-citation chain appears in the reported derivation. The method is self-contained against the stated benchmarks and can be falsified by new data with different missingness patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
XMD encoding, which augments raw features with observation masks and time-deltas to explicitly model data uncertainty, and 2) bidirectional Mamba-2, which captures temporal dependencies with linear computational complexity.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments on CLARE and CL-Drive datasets under leave-one-subject-out evaluation show that MambaGaze achieves 76.8% and 73.1% accuracy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cross-subject work- load classification using pupil-related measures
[Appelet al., 2018 ] Tobias Appel, Christian Scharinger, Pe- ter Gerjets, and Enkelejda Kasneci. Cross-subject work- load classification using pupil-related measures. InPro- ceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, ETRA ’18, New York, NY , USA,
work page 2018
-
[2]
[Bhatti and others, 2024] Umer Asgher Bhatti et al
Association for Computing Machinery. [Bhatti and others, 2024] Umer Asgher Bhatti et al. Clare: A dataset for cognitive load assessment in realtime.arXiv preprint,
work page 2024
-
[3]
[Bozkiret al., 2019 ] Efe Bozkir, David Geisler, and Enkele- jda Kasneci. Person independent, privacy preserving, and real time assessment of cognitive load using eye tracking in a virtual reality setup. In2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pages 1834– 1837,
work page 2019
-
[4]
[Cheet al., 2018 ] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085,
work page 2018
-
[5]
[Dao and Gu, 2024] Tri Dao and Albert Gu
Nature Scien- tific Reports - GRU-D. [Dao and Gu, 2024] Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning,
work page 2024
-
[6]
[Ekinet al., 2025 ] Merve Ekin, Krzysztof Krejtz, Carlos Duarte, Andrew Duchowski, and Izabela Krejtz. Pre- diction of intrinsic and extraneous cognitive load with oculometric and biometric indicators.Scientific Reports, 15:89336,
work page 2025
-
[7]
[Goodfellowet al., 2016 ] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep learning
Nature Scientific Reports. [Goodfellowet al., 2016 ] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep learning. MIT Press,
work page 2016
-
[8]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
[Gu and Dao, 2023] Albert Gu and Tri Dao. Mamba: Linear- time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
[Guet al., 2022 ] Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré
ICLR 2024 submission. [Guet al., 2022 ] Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré. On the parameterization and initializa- tion of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983,
work page 2024
-
[10]
De- coding cognitive load: eye-tracking insights into working memory and visual attention
[Jinet al., 2025 ] Xiaofu Jin, Yunpeng Bai, Lina Xu, Shuai Ma, Danqing Shi, Luwen Yu, and Mingming Fan. De- coding cognitive load: eye-tracking insights into working memory and visual attention. InProceedings of the 2025 Symposium on Eye Tracking Research and Applications. ACM,
work page 2025
-
[11]
ACM ETRA - Premier eye tracking venue. [Little and Rubin, 2020] Roderick JA Little and Donald B Rubin.Statistical analysis with missing data. Wiley, 3rd edition,
work page 2020
-
[12]
Decoupled weight decay regularization
[Loshchilov and Hutter, 2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations (ICLR 2019),
work page 2019
-
[13]
[Sarkar and Etemad, 2023] Platon Sarkar and Ali Etemad. Deep imputation of missing values in time series health data: A review with benchmarking.Journal of Biomedical Informatics, 144:104440,
work page 2023
-
[14]
Attention is all you need.Advances in Neural Information Processing Sys- tems, 30,
[Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Sys- tems, 30,
work page 2017
-
[15]
[Wanget al., 2013 ] Weihong Wang, Zhidong Li, Yang Wang, and Fang Chen
NeurIPS 2017 - Foundational Transformer paper. [Wanget al., 2013 ] Weihong Wang, Zhidong Li, Yang Wang, and Fang Chen. Indexing cognitive workload based on pupillary response under luminance and emo- tional changes. InProceedings of the 2013 International Conference on Intelligent User Interfaces, IUI ’13, page 247–256, New York, NY , USA,
work page 2017
-
[16]
Association for Computing Machinery. Appendix A Baseline Methods We compare MambaGaze against baseline methods for cog- nitive load classification from eye-tracking signals: CNN [Bhatti and others, 2024 ] represents the family of convolutional architectures for end-to-end eye-tracking anal- ysis. This baseline uses a shallow CNN with multiple convo- lutio...
-
[17]
Figure 5: NVIDIA Jetson Orin edge devices used for deployment benchmarks
All devices utilize a Unified Memory Architecture (UMA) where CPU and GPU share a single LPDDR5 mem- ory pool, making efficient memory management critical for deployment. Figure 5: NVIDIA Jetson Orin edge devices used for deployment benchmarks. From left to right: AGX Orin (high-performance), Orin NX (mid-range), and Orin Nano (entry-level). Spec AGX Orin...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.