pith. sign in

arxiv: 2505.24848 · v6 · submitted 2025-05-30 · 💻 cs.CV · cs.LG

Reading Recognition in the Wild

Pith reviewed 2026-05-19 12:30 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords reading recognitionegocentric visionmultimodal datasettransformer modeleye gazehead posesmart glassesactivity recognition
0
0 comments X

The pith

A flexible transformer recognizes reading from egocentric RGB, gaze and head pose on a new 100-hour dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to determine when a user is reading in everyday conditions so that always-on smart glasses can maintain a record of interactions with the world. It releases the first large multimodal dataset of 100 hours of reading and non-reading videos captured in varied real-world settings. The work demonstrates that egocentric RGB video, eye gaze and head pose serve as relevant and complementary signals when processed by a single flexible transformer model, and it explores efficient ways to encode each one. This capability would let contextual AI systems handle reading activities at realistic scale rather than only in controlled lab conditions.

Core claim

We introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task of reading recognition, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading studies to

What carries the argument

Flexible transformer model that encodes and fuses egocentric RGB, eye gaze and head pose to classify reading versus non-reading

If this is right

  • The three modalities can be used individually or in any combination to perform reading recognition.
  • Each modality admits efficient and effective encoding within the transformer.
  • The dataset supports classification of reading types beyond the constraints of prior lab studies.
  • Contextual AI systems can maintain records of user reading interactions in everyday settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Smart glasses equipped with this model could automatically offer reading assistance or generate summaries without explicit user commands.
  • The same signals might be combined with other egocentric tasks such as object detection to build richer activity timelines.
  • Performance on entirely new user groups or reading materials not present in the 100-hour collection would indicate the degree of true generalization.

Load-bearing premise

The 100 hours of reading and non-reading videos collected in diverse and realistic scenarios are representative enough to train and evaluate models that generalize to real-world reading recognition.

What would settle it

Testing the trained model on a new collection of videos recorded by different users in previously unseen environments and measuring whether accuracy remains comparable to results on the original dataset.

Figures

Figures reproduced from arXiv: 2505.24848 by Carl Ren, Charig Yang, Hyo Jin Kim, James Fort, Kiran Somasundaram, Lambert Mathias, Luis Pesqueira, Michael J. Proulx, Mi Zhang, Omkar Parkhi, Richard Newcombe, Samiul Alam, Shakhrul Iman Siam, Sheroze Sheriffdeen, Yuning Chai.

Figure 1
Figure 1. Figure 1: Am I reading? The left figure shows a timeline as the user navigates the world. We aim to solve the task of reading recognition to enable AI assistants in always-on wearables. We identify three modalities: eye gaze (in colored dot patterns), RGB crop around gaze (in red box), and inertial sensors performs the task to high accuracy (with Prediction and GT shown). Images from our Reading in the Wild dataset,… view at source ↗
Figure 2
Figure 2. Figure 2: Diversity in reading materials. Reading examples across different materials, both text type (rows) and medium (column). 3.2 Comparison to existing datasets The closest kins to our dataset come in two categories, as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Complementary modalities. Ex￾ample success and failure cases for gaze and RGB, suggesting the benefit of multimodality [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Main results and visualizations. We show the results on Seattle (test set). (a) Our method performs the task to good accuracy, and combining all modalities yields the best results. Metrics are accuracy and F1 score at 0.5 threshold, and precision at 0.9 recall. (b) We show: (i) Col. 1, banal success cases distinguishing reading from daily activities; (ii) Col. 2-4, difficult cases where our combined model … view at source ↗
Figure 6
Figure 6. Figure 6: Results breakdown. We present the breakdown for the main results, including (a) precision-recall curve for different modalities (b) breakdown by scenario to highlight difficult cases (c) breakdown by gaze span. Single modality. We find that gaze and RGB are able to achieve reasonable performance individually, and their performances are similar to each other (82.3% and 82.2% accuracy respectively). However,… view at source ↗
Figure 7
Figure 7. Figure 7: Real-time detection. We evaluate our model on alternating sequences for real-time detection. In (a), we show that (i) longer gaze sequences result in higher latency, (ii) RGB has lower latency than temporal signals (iii) adding RGB to gaze reduces the latency compared to gaze alone. We illustrate the results in (b). sions in terms of the complementary role between gaze and RGB, but IMU does not help as muc… view at source ↗
Figure 8
Figure 8. Figure 8: Noise robustness. Aug￾mentation (red) lowers degradation. GT \Pred 1 2 3 4 5 6 7 1 No read 0.88 0.04 0.02 0.02 0.01 0.03 0.00 2 Walk 0.09 0.85 0.04 0.01 0.00 0.00 0.01 3 Out loud 0.13 0.02 0.64 0.17 0.02 0.01 0.01 4 Engaged 0.14 0.02 0.06 0.54 0.12 0.01 0.11 5 Scan 0.08 0.01 0.03 0.39 0.41 0.00 0.08 6 Write/type 0.49 0.01 0.03 0.02 0.05 0.39 0.01 7 Skim 0.13 0.04 0.05 0.47 0.15 0.00 0.16 [PITH_FULL_IMAGE:… view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of reading materials in Seattle subset. In (a), we show the distribution of reading mediums. Within each medium (Print, Digital and Object), we then break down the reading materials in (b), (c), and (e). Additionally, for Digital media, we break down by the device involved in the recording. Refer to [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of reading modes in Seattle subset together with illustrations. Almost half of our reading samples is engaged reading, while other scenarios diversely reflect how people read in different scenarios. B.3 Negative data We collected two types of negative data: (1) everyday activities that do not involve reading and (2) hard negatives where text is visible but not being read by the participant. E… view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of negative data in Seattle subset. In addition to daily activities, our dataset also includes hard negatives, where the user has a text in scene but is not reading, making it indistinguishable using the RGB stream alone. B.4 Alternating sequences We also collected test sequences that alternate between reading and non-reading activities, allowing for the evaluation of temporal localization an… view at source ↗
Figure 12
Figure 12. Figure 12: Demographic statistics of the Seattle subset (i) shows the age group, (ii) shows gender distribution shown in [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of metadata for Seattle subset. The metadata contains several useful information to facilitate further research, including both multiple-choice questions and short answers. B.7 Protocols A successful data collection protocol ensures efficient collection while guaranteeing the quality. Reading is a complex process that includes word recognition, which encompasses visual processing and language deco… view at source ↗
Figure 14
Figure 14. Figure 14: Types of reading materials covered. The top row shows the distributions of (a) medium (b) text types (c) non-text content, in minutes. The bottom row shows the same distributions in terms of number of segments. ‘Medium’ indicates what kind of device or object the subject is reading from. The following is a list of reading materials included for different mediums. Similarly to the Seattle subset, we indica… view at source ↗
Figure 15
Figure 15. Figure 15: Platform distributions by medium. The top row shows the total duration in minutes for digital, print, and object platforms, while the bottom row displays the corresponding number of recordings. C.2 Type of reading modes covered In the Columbus subset, we have both instances of subject reading or positive cases and not reading or negative cases. However, similar to the Seattle subset, the mode of reading c… view at source ↗
Figure 16
Figure 16. Figure 16: Mirror Setup: Print Medium. Here subject is asked to read a comic vs. when asked to not read anything but instead look at pictures only. Reading Not reading [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Mirror Setup: Walking in Corridor. Here the subject is asked to read the room numbers and signs in a corridor vs when asked to traverse through normally. Reading (Engaged) Reading (Scanning) [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Mirror Setup: Reading vs Searching. Here the subject is asked to read the serial numbers in a circuit board vs when asked to specifically search and count the number of resistors. C.5 Annotation To annotate the raw data and extract segments, we utilized a labeling tool developed in PyQt5 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Graphical interface of annotation tool. The user interface consists of the RGB video preview with sound and the gaze point overlaid on top shown in green. The interface allows adjusting the start and end time of each recording as well as annotating metadata like content-type, content-length and medium. C.6 Demographics We posted flyers in various locations to inform interested people about the study. Part… view at source ↗
Figure 20
Figure 20. Figure 20: Demographic statistics of the Columbus subset. An overview of the demographic distribution is shown: (i) Age Range, (ii) Gender, (iii) Native vs Non-native Speakers, (iv) Visual Aid Requirements, and (v) Education. Pre-session Preparation The process began with participants filling out a demographic questionnaire in a controlled indoor environment. Following this, an attendant explained how to calibrate t… view at source ↗
Figure 21
Figure 21. Figure 21: Example metadata for Columbus subset. The metadata contained is slightly different than that of the Seattle subset, allowing for different directions to explore for further research. C.8.3 Post-processing After the sessions, we generate draft previews of each session for segmentation. The previews are low quality RGB videos of the session with the gaze overlaid on the videos. We get the approximate segmen… view at source ↗
Figure 22
Figure 22. Figure 22: PR Curves for different modalities We compare the PR curves of different modalities. (a) shows the curves for individual modalities and (b) compares how combining different modalities influence performance The results presented in [PITH_FULL_IMAGE:figures/full_fig_p025_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Breakdown by content length and content type in Columbus subset. The figures show the Accuracy and F1 scores respectively of different combinations of content length (Paragraph vs Short text) and content type (Text only vs Includes non-text), across different modalities. We show that reading detection works better on paragraphs and text-only cases. D.2 Result breakdown Content type and content length [PI… view at source ↗
Figure 24
Figure 24. Figure 24: Breakdown by medium and content type in Columbus subset. The figures show the Accuracy and F1 scores respectively of different combinations of medium (Print vs Digital vs Object) and content type (Text only vs Includes non-text), across different modalities. We show that reading detection works better on print and digital media. Medium and content type [PITH_FULL_IMAGE:figures/full_fig_p026_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: RGB fails but Gaze succeeds. Failure case across 6 frames where RGB Fails but Gaze works. Notice that the RGB crop indicated in red has partial coverage of the reading material. In [PITH_FULL_IMAGE:figures/full_fig_p028_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Gaze fails but RGB succeeds. Here, the subject is reading a map where text is irregularly placed across the field of view, making gaze patterns more sporadic. In contrast, [PITH_FULL_IMAGE:figures/full_fig_p028_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Individual modality fails, combined modality succeeds. Here, both gaze and RGB fail individually but succeed when combined. In [PITH_FULL_IMAGE:figures/full_fig_p029_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Misleading RGB: Gaze measurement error. Failure case across 6 frames where Gaze succeeds but RGB Only and Gaze with RGB fail. Note that here the eye gaze is offset due to measurement error, putting the RGB crop outside the reading material [PITH_FULL_IMAGE:figures/full_fig_p029_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Misleading RGB: Partial coverage. Failure case across 6 frames where Gaze succeeds but RGB Only and Gaze with RGB fail. Notice that the RGB crop indicated in red has partial coverage of the reading material. In [PITH_FULL_IMAGE:figures/full_fig_p029_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: All Modality Failure case: Searching on Map. Failure case across 6 frames where all modalities and their combinations fail. Here the participant is searching for a particular name on map [PITH_FULL_IMAGE:figures/full_fig_p030_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: All Modality Failure case: Reading room numbers while walking. Failure case across 6 frames where all modalities and their combinations fail. Note that here the subject is asked to read room numbers while walking. The gaze pattern scans across the corridor before reading the room number. Note that the model was not trained on this kind of scenario. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_31.png] view at source ↗
Figure 33
Figure 33. Figure 33: Not Reading [PITH_FULL_IMAGE:figures/full_fig_p032_33.png] view at source ↗
read the original abstract

To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the task of reading recognition in egocentric videos for always-on smart glasses, presents the Reading in the Wild dataset comprising 100 hours of multimodal reading and non-reading videos collected in diverse realistic scenarios, identifies three modalities (egocentric RGB, eye gaze, and head pose), proposes a flexible transformer model that processes these modalities individually or in combination, claims to demonstrate that the modalities are relevant and complementary, investigates efficient encoding strategies for each, and shows the dataset's utility for classifying types of reading at larger scale and realism than prior constrained studies.

Significance. If the unreported experiments confirm modality complementarity and dataset utility with appropriate ablations and generalization tests, the work could provide a valuable large-scale benchmark for multimodal egocentric vision and advance contextual AI applications in wearable devices by extending reading understanding beyond lab settings.

major comments (2)
  1. [Abstract] Abstract: The claim that 'these modalities are relevant and complementary to the task' is load-bearing for the central contribution yet is asserted without any quantitative results, ablation studies, fusion performance deltas, error analysis, or evaluation metrics, making it impossible to verify whether the data supports the assertion.
  2. [Abstract] Abstract: The load-bearing assumption that the 100 hours of videos 'in diverse and realistic scenarios' are representative enough to train and evaluate models that generalize is stated without details on collection protocol, annotation process, diversity quantification, or train/test splits, preventing assessment of potential biases or leakage.
minor comments (1)
  1. [Abstract] Abstract: The description of the transformer as 'flexible' and the investigation of 'how to efficiently and effectively encode each modality' would benefit from at least high-level architectural or encoding details even in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our manuscript. We address each of the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'these modalities are relevant and complementary to the task' is load-bearing for the central contribution yet is asserted without any quantitative results, ablation studies, fusion performance deltas, error analysis, or evaluation metrics, making it impossible to verify whether the data supports the assertion.

    Authors: We agree that the abstract, being a concise summary, does not include specific quantitative evidence. The full manuscript contains sections with ablation studies, performance comparisons for individual and combined modalities, and metrics that support the relevance and complementarity of egocentric RGB, eye gaze, and head pose. To improve clarity, we will revise the abstract to include a brief statement referencing these experimental findings. revision: yes

  2. Referee: [Abstract] Abstract: The load-bearing assumption that the 100 hours of videos 'in diverse and realistic scenarios' are representative enough to train and evaluate models that generalize is stated without details on collection protocol, annotation process, diversity quantification, or train/test splits, preventing assessment of potential biases or leakage.

    Authors: The abstract summarizes the key aspects of the dataset. Detailed information on the collection protocol, annotation process, measures of diversity, and the train/test splits (designed to prevent data leakage) is provided in the main body of the paper. We acknowledge that incorporating a short reference to these elements in the abstract would help address concerns about generalization and potential biases. We will make this revision. revision: yes

Circularity Check

0 steps flagged

No circularity: new dataset and empirical claims with no derivation chain present

full rationale

Only the abstract is available, which introduces a new task, a 100-hour multimodal dataset, and a flexible transformer using RGB, gaze, and pose modalities. No equations, fitted parameters, self-citations, or derivations appear that could reduce a claimed result to its own inputs by construction. The contribution is framed as data collection plus empirical model application, which is self-contained against external benchmarks and contains none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only information provides no identifiable free parameters, axioms, or invented entities; the work appears to rely on standard multimodal learning techniques applied to a new dataset.

pith-pipeline@v0.9.0 · 5713 in / 1039 out tokens · 52012 ms · 2026-05-19T12:30:19.929143+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

  1. [1]

    Towards predicting reading comprehension from gaze behavior

    Seoyoung Ahn, Conor Kelton, Aruna Balasubramanian, and Greg Zelinsky. Towards predicting reading comprehension from gaze behavior. InETRA, 2020

  2. [2]

    Whisperx: Time-accurate speech transcrip- tion of long-form audio.INTERSPEECH, 2023

    Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcrip- tion of long-form audio.INTERSPEECH, 2023

  3. [3]

    A robust realtime reading-skimming classifier

    Ralf Biedert, Jörn Hees, Andreas Dengel, and Georg Buscher. A robust realtime reading-skimming classifier. InETRA, 2012

  4. [4]

    Robust recognition of reading activ- ity in transit using wearable electrooculography

    Andreas Bulling, Jamie A Ward, Hans Gellersen, and Gerhard Tröster. Robust recognition of reading activ- ity in transit using wearable electrooculography. InPervasive Computing: 6th International Conference, Pervasive 2008 Sydney, Australia, May 19-22, 2008 Proceedings 6, pages 19–37. Springer, 2008

  5. [5]

    Visual attentional training improves reading capabilities in children with dyslexia: An eye tracker study during a reading task.Brain sciences, 10(8), 2020

    Simona Caldani, Christophe-Loïc Gerard, Hugo Peyre, and Maria Pia Bucci. Visual attentional training improves reading capabilities in children with dyslexia: An eye tracker study during a reading task.Brain sciences, 10(8), 2020

  6. [6]

    A robust algorithm for reading detection

    Christopher S Campbell and Paul P Maglio. A robust algorithm for reading detection. InPUI, 2001

  7. [7]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

  8. [8]

    Improving the understanding of web user behaviors through machine learning analysis of eye-tracking data.User Modeling and User-Adapted Interaction, 34(2), 2024

    Diana Castilla, Omar Del Tejo Catalá, Patricia Pons, François Signol, Beatriz Rey, Carlos Suso-Ribera, and Juan-Carlos Perez-Cortes. Improving the understanding of web user behaviors through machine learning analysis of eye-tracking data.User Modeling and User-Adapted Interaction, 34(2), 2024

  9. [9]

    Gazexplain: Learning to predict natural language explanations of visual scanpaths

    Xianyu Chen, Ming Jiang, and Qi Zhao. Gazexplain: Learning to predict natural language explanations of visual scanpaths. InECCV, 2024

  10. [10]

    Predicting reading comprehension scores from eye movements using artificial neural networks and fuzzy output error.Artif

    Leana Copeland, Tom Gedeon, and B Sumudu U Mendis. Predicting reading comprehension scores from eye movements using artificial neural networks and fuzzy output error.Artif. Intell. Res., 3(3), 2014

  11. [11]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. InECCV, 2018

  12. [12]

    Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, Cheng Peng, Chris Sweeney, Cole Wilson, Dan Barnes, Daniel DeTone, David Caruso, Derek Valleroy, Dinesh Ginjupalli, Duncan Frost, Edward Miller, Elias Mueggler, Evgeniy Oleinik, Fan Zhang, Guruprasad So...

  13. [13]

    Oat: Object-level attention transformer for gaze scanpath prediction

    Yini Fang, Jingling Yu, Haozheng Zhang, Ralf van der Lans, and Bertram Shi. Oat: Object-level attention transformer for gaze scanpath prediction. InECCV, 2024

  14. [14]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 11

  15. [15]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Gird- har, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

  16. [16]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024

  17. [17]

    Zuco, a simultaneous eeg and eye-tracking resource for natural sentence reading.Scientific data, 5(1), 2018

    Nora Hollenstein, Jonathan Rotsztejn, Marius Troendle, Andreas Pedroni, Ce Zhang, and Nicolas Langer. Zuco, a simultaneous eeg and eye-tracking resource for natural sentence reading.Scientific data, 5(1), 2018

  18. [18]

    Involution fused convnet for classifying eye- tracking patterns of children with autism spectrum disorder.Engineering Applications of Artificial Intelligence, 2025

    Md Farhadul Islam, Meem Arafat Manab, Joyanta Jyoti Mondal, Sarah Zabeen, Fardin Bin Rahman, Md Zahidul Hasan, Farig Sadeque, and Jannatun Noor. Involution fused convnet for classifying eye- tracking patterns of children with autism spectrum disorder.Engineering Applications of Artificial Intelligence, 2025

  19. [19]

    Icdar 2024 competition on reading documents through aria glasses

    Soumya Shamarao Jahagirdar, Ajoy Mondal, Yuheng Ren, Omkar M Parkhi, and CV Jawahar. Icdar 2024 competition on reading documents through aria glasses. InICDAR, 2024

  20. [20]

    Boosting gaze object prediction via pixel-level supervision from vision foundation model

    Yang Jin, Lei Zhang, Shi Yan, Bin Fan, and Binglu Wang. Boosting gaze object prediction via pixel-level supervision from vision foundation model. InECCV, 2024

  21. [21]

    A theory of reading: from eye fixations to comprehension

    Marcel A Just and Patricia A Carpenter. A theory of reading: from eye fixations to comprehension. Psychological review, 87(4), 1980

  22. [22]

    Epic-fusion: Audio-visual temporal binding for egocentric action recognition

    Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. InICCV, 2019

  23. [23]

    Reading detection in real-time

    Conor Kelton, Zijun Wei, Seoyoung Ahn, Aruna Balasubramanian, Samir R Das, Dimitris Samaras, and Gregory Zelinsky. Reading detection in real-time. InETRA, 2019

  24. [24]

    Enhancing human-computer interaction in chest x-ray analysis using vision and language model with eye gaze patterns

    Yunsoo Kim, Jinge Wu, Yusuf Abdulle, Yue Gao, and Honghan Wu. Enhancing human-computer interaction in chest x-ray analysis using vision and language model with eye gaze patterns. InMICCAI, 2024

  25. [25]

    Gaze-detr: Using expert gaze to reduce false positives in vulvovaginal candidiasis screening

    Yan Kong, Sheng Wang, Jiangdong Cai, Zihao Zhao, Zhenrong Shen, Yonghao Li, Manman Fei, and Qian Wang. Gaze-detr: Using expert gaze to reduce false positives in vulvovaginal candidiasis screening. In MICCAI, 2024

  26. [26]

    Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear.arXiv preprint arXiv:2401.17217, 2024

    Robert Konrad, Nitish Padmanaban, J Gabriel Buckmaster, Kevin C Boyle, and Gordon Wetzstein. Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear.arXiv preprint arXiv:2401.17217, 2024

  27. [27]

    I know what you are reading: recognition of document types using mobile eye tracking

    Kai Kunze, Yuzuko Utsumi, Yuki Shiga, Koichi Kise, and Andreas Bulling. I know what you are reading: recognition of document types using mobile eye tracking. InISWC, 2013

  28. [28]

    Classification of reading and not reading behavior based on eye movement analysis

    Manuel Landsmann, Olivier Augereau, and Koichi Kise. Classification of reading and not reading behavior based on eye movement analysis. InISWC, 2019

  29. [29]

    In the eye of beholder: Joint learning of gaze and actions in first person video

    Yin Li, Miao Liu, and James M Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. InECCV, 2018

  30. [30]

    Classification of reading patterns based on gaze information

    Wen-Hung Liao, Chin-Wen Chang, and Yi-Chieh Wu. Classification of reading patterns based on gaze information. In2017 IEEE International Symposium on Multimedia (ISM). IEEE, 2017. 12

  31. [31]

    Gazehta: End-to-end gaze target detection with head-target association.arXiv preprint arXiv:2404.10718, 2024

    Zhi-Yi Lin, Jouh Yeong Chew, Jan van Gemert, and Xucong Zhang. Gazehta: End-to-end gaze target detection with head-target association.arXiv preprint arXiv:2404.10718, 2024

  32. [32]

    Gem: Context-aware gaze estimation with visual search behavior matching for chest radiograph

    Shaonan Liu, Wenting Chen, Jie Liu, Xiaoling Luo, and Linlin Shen. Gem: Context-aware gaze estimation with visual search behavior matching for chest radiograph. InMICCAI, 2024

  33. [33]

    Using eye-tracking measures to predict reading comprehension.Reading Research Quarterly, 58(3), 2023

    Diane C Mézière, Lili Yu, Erik D Reichle, Titus V on Der Malsburg, and Genevieve McArthur. Using eye-tracking measures to predict reading comprehension.Reading Research Quarterly, 58(3), 2023

  34. [34]

    Integrating human gaze into attention for egocentric activity recognition

    Kyle Min and Jason J Corso. Integrating human gaze into attention for egocentric activity recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1069–1078, 2021

  35. [35]

    Look hear: Gaze prediction for speech-directed human attention

    Sounak Mondal, Seoyoung Ahn, Zhibo Yang, Niranjan Balasubramanian, Dimitris Samaras, Gregory Zelinsky, and Minh Hoai. Look hear: Gaze prediction for speech-directed human attention. InECCV, 2024

  36. [36]

    Thumb’s rule tested: Visual angle of thumb’s width is about 2 deg.Perception, 20, 1991

    Robert O’Shea. Thumb’s rule tested: Visual angle of thumb’s width is about 2 deg.Perception, 20, 1991

  37. [37]

    A transformer-based model for the prediction of human gaze behavior on videos

    Süleyman Özdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, and Enkelejda Kasneci. A transformer-based model for the prediction of human gaze behavior on videos. InETRA, 2024

  38. [38]

    Egoblur: Responsible innovation in aria, 2023

    Nikhil Raina, Guruprasad Somasundaram, Kang Zheng, Sagar Miglani, Steve Saarinen, Jeff Meissner, Mark Schwesinger, Luis Pesqueira, Ishita Prasad, Edward Miller, Prince Gupta, Mingfei Yan, Richard Newcombe, Carl Ren, and Omkar M Parkhi. Egoblur: Responsible innovation in aria, 2023

  39. [39]

    On understanding creative language: the late positive complex and novel metaphor comprehension.Brain research, 1678, 2018

    Karolina Rataj, Anna Przekoracka-Krawczyk, and Rob HJ Van der Lubbe. On understanding creative language: the late positive complex and novel metaphor comprehension.Brain research, 1678, 2018

  40. [40]

    Learning user embeddings from human gaze for personalised saliency prediction

    Florian Strohm, Mihai Bâce, and Andreas Bulling. Learning user embeddings from human gaze for personalised saliency prediction. InETRA, 2024

  41. [41]

    Sara: Smart ai reading assistant for reading comprehension

    Enkeleda Thaqi, Mohamed Omar Mantawy, and Enkelejda Kasneci. Sara: Smart ai reading assistant for reading comprehension. InETRA, 2024

  42. [42]

    Gaze-guided hand-object interaction synthesis: Benchmark and method.arXiv preprint arXiv:2403.16169, 2024

    Jie Tian, Lingxiao Yang, Ran Ji, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, and Jingya Wang. Gaze-guided hand-object interaction synthesis: Benchmark and method.arXiv preprint arXiv:2403.16169, 2024

  43. [43]

    Gazeprompt: Enhancing low vision people’s reading experience with gaze-aware augmentations

    Ru Wang, Zach Potter, Yun Ho, Daniel Killough, Linxiu Zeng, Sanbrita Mondal, and Yuhang Zhao. Gazeprompt: Enhancing low vision people’s reading experience with gaze-aware augmentations. InCHI Conference on Human Factors in Computing Systems, 2024

  44. [44]

    Gaze-directed vision gnn for mitigating shortcut learning in medical image

    Shaoxuan Wu, Xiao Zhang, Bin Wang, Zhuo Jin, Hansheng Li, and Jun Feng. Gaze-directed vision gnn for mitigating shortcut learning in medical image. InMICCAI, 2024

  45. [45]

    Fast and accurate text classification: Skimming, rereading and early stopping

    Keyi Yu, Yang Liu, Alexander G Schwing, and Jian Peng. Fast and accurate text classification: Skimming, rereading and early stopping. InICLR, 2018

  46. [46]

    Interead: An eye tracking dataset of interrupted reading

    Francesca Zermiani, Prajit Dhar, Ekta Sood, Fabian Kögel, Andreas Bulling, and Maria Wirzberger. Interead: An eye tracking dataset of interrupted reading. InLREC-COLING, 2024

  47. [47]

    Can gaze inform egocentric action recognition? InETRA, 2022

    Zehua Zhang, David Crandall, Michael Proulx, Sachin Talathi, and Abhishek Sharma. Can gaze inform egocentric action recognition? InETRA, 2022

  48. [48]

    name": "13. Write or type texts - Read Out Loud 21

    Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. Learning video representations from large language models. InCVPR, 2023. 13 Reading Recognition in the Wild —Supplementary Material— A Introduction Additional dataset details.Our dataset is the first instance of reading activity recognition dataset in unconstrained environments and is also the ...

  49. [49]

    Enabled modalities Gaze ✗ ✗✓ ✓ RGB ✓ ✓✗✓ IMU ✗ ✗ ✗✓ Fusion ✓✗ ✗✓

  50. [50]

    flow) 0.310 0.545

    On-device Feasibility ✗ ✗✓ ✓ Number of parameters 11B 25M 1k 130k Sensing cost (power) high high low low RGB requirements full RGB full RGB video - foveated patch (5° FoV) (dominates sensing cost) (optional) Real-time ✗ ✗✓ ✓ Inference time (ms) 567.410 895.511 (incl. flow) 0.310 0.545

  51. [51]

    Inspect/read recipe

    Performance Zero-shot capability ✓✗ ✗✓ Acc / F1 on RiTW Columbus 76.7 / 65.6 - -82.9 / 88.8 Acc / F1 on EGTEA dataset 89.6 / 61.5 88.8 / 65.8 85.8 / 62.889.6 / 70.6 Table 13:Comparison of alternative methods.This table compares approaches for reading recognition, including (i) vision-language models (VLMs), (ii) action recognition models, and (iii) altern...