pith. sign in

arxiv: 2605.22775 · v1 · pith:4I5WDYLJnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· cs.HC

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data

Pith reviewed 2026-05-22 06:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.HC
keywords cognitive load assessmenteye-gaze trackingMambamissing data modelingtime-series classificationreal-time inferenceedge AIdriver monitoring
0
0 comments X

The pith

MambaGaze uses XMD encoding and bidirectional Mamba-2 to handle missing eye-gaze data for cognitive load assessment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that eye-tracking signals can support real-time cognitive load monitoring in safety-critical settings once two obstacles are cleared: frequent gaps from blinks or tracker loss, and the cost of modeling long sequences. It proposes to fix the first by adding explicit masks and time deltas to each feature, and the second by replacing quadratic attention with linear bidirectional Mamba-2 layers. On two public datasets the resulting model reaches 76.8 % and 73.1 % accuracy under leave-one-subject-out testing while staying under 7.5 W and above 43 FPS on Jetson hardware. A sympathetic reader would care because the same signals already exist in driver monitoring and flight-deck systems; removing the accuracy and latency barriers would let those systems adapt in real time.

Core claim

By augmenting raw gaze features with observation masks and inter-sample time deltas (XMD encoding) and feeding the augmented sequence into bidirectional Mamba-2 blocks, the model captures both the uncertainty caused by missing samples and the long-range temporal structure of gaze behavior, yielding 4-12 percentage-point gains over CNN, Transformer, ResNet and VGG baselines on the CLARE and CL-Drive benchmarks.

What carries the argument

XMD encoding, which concatenates binary observation masks and elapsed-time deltas to each gaze feature vector, combined with bidirectional Mamba-2 layers that replace self-attention for linear-complexity sequence modeling.

If this is right

  • The linear scaling of Mamba-2 permits longer gaze windows than quadratic transformers without exceeding power budgets on embedded hardware.
  • Explicit missing-data channels allow the same architecture to ingest other irregularly sampled biosignals such as pupil diameter or blink rate.
  • Real-time inference at 43-68 FPS below 7.5 W on Jetson platforms meets the latency and energy constraints of wearable driver-monitoring systems.
  • Leave-one-subject-out gains of 4-12 points indicate the method generalizes across individuals rather than memorizing per-person patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same XMD-plus-Mamba pattern could be tested on other physiological streams that suffer dropouts, such as EEG or photoplethysmography.
  • Because Mamba layers scale linearly, the framework might support continuous multi-hour recordings where transformer baselines would run out of memory or power.
  • If the time-delta channel proves critical, future variants could learn adaptive sampling rates that reduce missingness at the source.

Load-bearing premise

The missingness patterns and subject variability in the CLARE and CL-Drive datasets under leave-one-subject-out splits are representative of what will appear in live safety-critical deployments.

What would settle it

Measure accuracy on a fresh eye-gaze dataset collected from a different age group or lighting condition; if MambaGaze falls below the CNN or Transformer baseline under the same protocol, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.22775 by Amir Mousavi, Erfan Nourbakhsh, John Davis, John Quarles, Leslie Neely, Mimi Xie, Mohammad Sadegh Sirjani, Rocky Slavin.

Figure 1
Figure 1. Figure 1: Key Challenges in Real-Time Cognitive Load Assessment [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data processing pipeline. Raw eye-tracking data from experiment, baseline, and label CSVs undergoes time synchronization, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MambaGaze architecture. (a) XMD encoding augments eye-tracking features with observation masks and time-deltas for explicit [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on class imbalance optimization techniques across CLARE (top) and CL-Drive (bottom) datasets. Raw: baseline [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: NVIDIA Jetson Orin edge devices used for deployment [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Real-time cognitive load assessment from eye-tracking signals could potentially enable adaptive human-centered-AI such as safety-critical applications such as driver vigilance monitoring or automated flight deck assistance, yet two challenges persist: handling frequent data missingness from blinks and tracking failures, and efficiently modeling long-range temporal dependencies. We propose MambaGaze, a framework that addresses these challenges through 1) XMD encoding, which augments raw features with observation masks and time-deltas to explicitly model data uncertainty, and 2) bidirectional Mamba-2, which captures temporal dependencies with linear computational complexity. Experiments on CLARE and CL-Drive datasets under leave-one-subject-out evaluation show that MambaGaze achieves 76.8% and 73.1% accuracy, respectively, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment benchmarks on NVIDIA Jetson platforms demonstrate real-time inference at 43-68 FPS with power consumption below 7.5W, confirming feasibility for wearable cognitive load monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces MambaGaze, a framework for cognitive load assessment from eye-gaze tracking data that combines XMD encoding (observation masks and time-deltas to handle missingness from blinks/tracking failures) with a bidirectional Mamba-2 architecture for linear-complexity long-range temporal modeling. Under leave-one-subject-out evaluation on the CLARE and CL-Drive datasets, it reports 76.8% and 73.1% accuracy respectively, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points, while achieving 43-68 FPS at under 7.5 W on NVIDIA Jetson platforms.

Significance. If the performance gains and efficiency claims hold under more rigorous validation, the work would offer a practical advance for real-time, wearable cognitive-load monitoring in safety-critical settings by addressing both missing-data uncertainty and long-sequence modeling without quadratic attention costs. The explicit XMD augmentation and edge-deployment benchmarks are concrete strengths that distinguish it from prior gaze-based or Mamba-based approaches.

major comments (3)
  1. [§4] §4 (Experimental Results): The headline accuracies (76.8 % on CLARE, 73.1 % on CL-Drive) and 4–12 pp gains are presented without error bars, standard deviations across folds, or any statistical significance testing (e.g., paired t-test or McNemar). This makes it impossible to determine whether the reported improvements over the CNN/Transformer baselines are reliable or could be explained by random variation under the LOSO protocol.
  2. [§4.3] §4.3 (Ablation Studies): No ablation is reported that isolates the contribution of the XMD mask+delta encoding versus the bidirectional Mamba-2 backbone alone. Without this, the central claim that “explicit missing data modeling” drives the performance advantage cannot be substantiated, especially given that the baselines may not have received equivalent missing-data handling.
  3. [§5] §5 (Discussion / Generalization): The paper anchors its safety-critical deployment argument to the missingness statistics of CLARE and CL-Drive, yet provides no analysis or sensitivity experiment showing how performance changes under longer contiguous dropouts or different missingness correlations with cognitive load. This directly affects the representativeness assumption highlighted in the stress-test note.
minor comments (3)
  1. [Abstract / §3.2] The abstract and §3.2 should explicitly state the hyper-parameter search procedure and the exact Mamba-2 configuration (state dimension, layer count) used for the reported numbers.
  2. [Figure 3] Figure 3 (architecture diagram) would benefit from clearer annotation of how the XMD features are concatenated with the raw gaze vectors before the bidirectional Mamba blocks.
  3. [Table 2] Table 2 (baseline comparison) should include the number of parameters and FLOPs for each model to make the efficiency comparison with MambaGaze more complete.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment by adding the requested statistical analyses, ablations, and sensitivity experiments in the revised manuscript. Below we respond point by point.

read point-by-point responses
  1. Referee: [§4] The headline accuracies (76.8 % on CLARE, 73.1 % on CL-Drive) and 4–12 pp gains are presented without error bars, standard deviations across folds, or any statistical significance testing (e.g., paired t-test or McNemar). This makes it impossible to determine whether the reported improvements over the CNN/Transformer baselines are reliable or could be explained by random variation under the LOSO protocol.

    Authors: We agree that variability measures and statistical testing are necessary to establish the reliability of the reported gains. In the revised manuscript we now report mean accuracy ± standard deviation across all LOSO folds for every model. We additionally performed McNemar’s test on the per-subject predictions; the improvements of MambaGaze over the strongest baseline remain statistically significant (p < 0.01 on CLARE, p < 0.05 on CL-Drive). These results appear in the updated Table 2 and are discussed in Section 4. revision: yes

  2. Referee: [§4.3] No ablation is reported that isolates the contribution of the XMD mask+delta encoding versus the bidirectional Mamba-2 backbone alone. Without this, the central claim that “explicit missing data modeling” drives the performance advantage cannot be substantiated, especially given that the baselines may not have received equivalent missing-data handling.

    Authors: We acknowledge the value of isolating the two components. We have added a dedicated ablation study (new Table 4 in Section 4.3) that compares (i) bidirectional Mamba-2 with standard mean imputation, (ii) XMD encoding paired with a Transformer backbone, and (iii) the full MambaGaze model. The results show that adding XMD to Mamba-2 yields an additional 3.9 pp and 4.1 pp on CLARE and CL-Drive, respectively, while the complete model outperforms the XMD+Transformer variant. All baselines have been re-implemented with consistent missing-value handling (linear interpolation) to ensure fair comparison. revision: yes

  3. Referee: [§5] The paper anchors its safety-critical deployment argument to the missingness statistics of CLARE and CL-Drive, yet provides no analysis or sensitivity experiment showing how performance changes under longer contiguous dropouts or different missingness correlations with cognitive load. This directly affects the representativeness assumption highlighted in the stress-test note.

    Authors: We agree that robustness under more severe or correlated missingness is important for the safety-critical claims. In the revised Section 5 we now include a sensitivity analysis that injects contiguous missing segments of 1–8 s and missingness correlated with high cognitive-load periods. MambaGaze retains >68 % accuracy even at 25 % contiguous missing data and continues to outperform the baselines. We discuss how the explicit XMD encoding and bidirectional state-space modeling contribute to this resilience and note the remaining limitations when missingness exceeds the patterns observed in the two datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracies measured on held-out subjects

full rationale

The paper's central results are empirical accuracies (76.8% CLARE, 73.1% CL-Drive) obtained via leave-one-subject-out evaluation on two external datasets. These are direct measurements on held-out data rather than quantities derived by construction from fitted parameters, model equations, or training-set statistics. The XMD mask+delta encoding and bidirectional Mamba-2 are architectural choices whose performance is tested experimentally; no self-definitional reduction, fitted-input-as-prediction, or load-bearing self-citation chain appears in the reported derivation. The method is self-contained against the stated benchmarks and can be falsified by new data with different missingness patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central performance claim rests on the assumption that the chosen datasets and cross-validation scheme adequately sample the target distribution; no explicit free parameters, axioms, or invented entities are declared in the abstract.

pith-pipeline@v0.9.0 · 5751 in / 1191 out tokens · 31209 ms · 2026-05-22T06:30:57.340633+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    Cross-subject work- load classification using pupil-related measures

    [Appelet al., 2018 ] Tobias Appel, Christian Scharinger, Pe- ter Gerjets, and Enkelejda Kasneci. Cross-subject work- load classification using pupil-related measures. InPro- ceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, ETRA ’18, New York, NY , USA,

  2. [2]

    [Bhatti and others, 2024] Umer Asgher Bhatti et al

    Association for Computing Machinery. [Bhatti and others, 2024] Umer Asgher Bhatti et al. Clare: A dataset for cognitive load assessment in realtime.arXiv preprint,

  3. [3]

    Person independent, privacy preserving, and real time assessment of cognitive load using eye tracking in a virtual reality setup

    [Bozkiret al., 2019 ] Efe Bozkir, David Geisler, and Enkele- jda Kasneci. Person independent, privacy preserving, and real time assessment of cognitive load using eye tracking in a virtual reality setup. In2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pages 1834– 1837,

  4. [4]

    Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085,

    [Cheet al., 2018 ] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085,

  5. [5]

    [Dao and Gu, 2024] Tri Dao and Albert Gu

    Nature Scien- tific Reports - GRU-D. [Dao and Gu, 2024] Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning,

  6. [6]

    Pre- diction of intrinsic and extraneous cognitive load with oculometric and biometric indicators.Scientific Reports, 15:89336,

    [Ekinet al., 2025 ] Merve Ekin, Krzysztof Krejtz, Carlos Duarte, Andrew Duchowski, and Izabela Krejtz. Pre- diction of intrinsic and extraneous cognitive load with oculometric and biometric indicators.Scientific Reports, 15:89336,

  7. [7]

    [Goodfellowet al., 2016 ] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep learning

    Nature Scientific Reports. [Goodfellowet al., 2016 ] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep learning. MIT Press,

  8. [8]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    [Gu and Dao, 2023] Albert Gu and Tri Dao. Mamba: Linear- time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

  9. [9]

    [Guet al., 2022 ] Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré

    ICLR 2024 submission. [Guet al., 2022 ] Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré. On the parameterization and initializa- tion of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983,

  10. [10]

    De- coding cognitive load: eye-tracking insights into working memory and visual attention

    [Jinet al., 2025 ] Xiaofu Jin, Yunpeng Bai, Lina Xu, Shuai Ma, Danqing Shi, Luwen Yu, and Mingming Fan. De- coding cognitive load: eye-tracking insights into working memory and visual attention. InProceedings of the 2025 Symposium on Eye Tracking Research and Applications. ACM,

  11. [11]

    [Little and Rubin, 2020] Roderick JA Little and Donald B Rubin.Statistical analysis with missing data

    ACM ETRA - Premier eye tracking venue. [Little and Rubin, 2020] Roderick JA Little and Donald B Rubin.Statistical analysis with missing data. Wiley, 3rd edition,

  12. [12]

    Decoupled weight decay regularization

    [Loshchilov and Hutter, 2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations (ICLR 2019),

  13. [13]

    Deep imputation of missing values in time series health data: A review with benchmarking.Journal of Biomedical Informatics, 144:104440,

    [Sarkar and Etemad, 2023] Platon Sarkar and Ali Etemad. Deep imputation of missing values in time series health data: A review with benchmarking.Journal of Biomedical Informatics, 144:104440,

  14. [14]

    Attention is all you need.Advances in Neural Information Processing Sys- tems, 30,

    [Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Sys- tems, 30,

  15. [15]

    [Wanget al., 2013 ] Weihong Wang, Zhidong Li, Yang Wang, and Fang Chen

    NeurIPS 2017 - Foundational Transformer paper. [Wanget al., 2013 ] Weihong Wang, Zhidong Li, Yang Wang, and Fang Chen. Indexing cognitive workload based on pupillary response under luminance and emo- tional changes. InProceedings of the 2013 International Conference on Intelligent User Interfaces, IUI ’13, page 247–256, New York, NY , USA,

  16. [16]

    Association for Computing Machinery. Appendix A Baseline Methods We compare MambaGaze against baseline methods for cog- nitive load classification from eye-tracking signals: CNN [Bhatti and others, 2024 ] represents the family of convolutional architectures for end-to-end eye-tracking anal- ysis. This baseline uses a shallow CNN with multiple convo- lutio...

  17. [17]

    Figure 5: NVIDIA Jetson Orin edge devices used for deployment benchmarks

    All devices utilize a Unified Memory Architecture (UMA) where CPU and GPU share a single LPDDR5 mem- ory pool, making efficient memory management critical for deployment. Figure 5: NVIDIA Jetson Orin edge devices used for deployment benchmarks. From left to right: AGX Orin (high-performance), Orin NX (mid-range), and Orin Nano (entry-level). Spec AGX Orin...