Reading Recognition in the Wild

Carl Ren; Charig Yang; Hyo Jin Kim; James Fort; Kiran Somasundaram; Lambert Mathias; Luis Pesqueira; Michael J. Proulx; Mi Zhang; Omkar Parkhi

arxiv: 2505.24848 · v6 · submitted 2025-05-30 · 💻 cs.CV · cs.LG

Reading Recognition in the Wild

Charig Yang , Samiul Alam , Shakhrul Iman Siam , Michael J. Proulx , Lambert Mathias , Kiran Somasundaram , Luis Pesqueira , James Fort

show 7 more authors

Sheroze Sheriffdeen Omkar Parkhi Carl Ren Mi Zhang Yuning Chai Richard Newcombe Hyo Jin Kim

This is my paper

Pith reviewed 2026-05-19 12:30 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords reading recognitionegocentric visionmultimodal datasettransformer modeleye gazehead posesmart glassesactivity recognition

0 comments

The pith

A flexible transformer recognizes reading from egocentric RGB, gaze and head pose on a new 100-hour dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to determine when a user is reading in everyday conditions so that always-on smart glasses can maintain a record of interactions with the world. It releases the first large multimodal dataset of 100 hours of reading and non-reading videos captured in varied real-world settings. The work demonstrates that egocentric RGB video, eye gaze and head pose serve as relevant and complementary signals when processed by a single flexible transformer model, and it explores efficient ways to encode each one. This capability would let contextual AI systems handle reading activities at realistic scale rather than only in controlled lab conditions.

Core claim

We introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task of reading recognition, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading studies to

What carries the argument

Flexible transformer model that encodes and fuses egocentric RGB, eye gaze and head pose to classify reading versus non-reading

If this is right

The three modalities can be used individually or in any combination to perform reading recognition.
Each modality admits efficient and effective encoding within the transformer.
The dataset supports classification of reading types beyond the constraints of prior lab studies.
Contextual AI systems can maintain records of user reading interactions in everyday settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Smart glasses equipped with this model could automatically offer reading assistance or generate summaries without explicit user commands.
The same signals might be combined with other egocentric tasks such as object detection to build richer activity timelines.
Performance on entirely new user groups or reading materials not present in the 100-hour collection would indicate the degree of true generalization.

Load-bearing premise

The 100 hours of reading and non-reading videos collected in diverse and realistic scenarios are representative enough to train and evaluate models that generalize to real-world reading recognition.

What would settle it

Testing the trained model on a new collection of videos recorded by different users in previously unseen environments and measuring whether accuracy remains comparable to results on the original dataset.

Figures

Figures reproduced from arXiv: 2505.24848 by Carl Ren, Charig Yang, Hyo Jin Kim, James Fort, Kiran Somasundaram, Lambert Mathias, Luis Pesqueira, Michael J. Proulx, Mi Zhang, Omkar Parkhi, Richard Newcombe, Samiul Alam, Shakhrul Iman Siam, Sheroze Sheriffdeen, Yuning Chai.

**Figure 2.** Figure 2: Diversity in reading materials. Reading examples across different materials, both text type (rows) and medium (column). 3.2 Comparison to existing datasets The closest kins to our dataset come in two categories, as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Complementary modalities. Example success and failure cases for gaze and RGB, suggesting the benefit of multimodality [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Main results and visualizations. We show the results on Seattle (test set). (a) Our method performs the task to good accuracy, and combining all modalities yields the best results. Metrics are accuracy and F1 score at 0.5 threshold, and precision at 0.9 recall. (b) We show: (i) Col. 1, banal success cases distinguishing reading from daily activities; (ii) Col. 2-4, difficult cases where our combined model … view at source ↗

**Figure 6.** Figure 6: Results breakdown. We present the breakdown for the main results, including (a) precision-recall curve for different modalities (b) breakdown by scenario to highlight difficult cases (c) breakdown by gaze span. Single modality. We find that gaze and RGB are able to achieve reasonable performance individually, and their performances are similar to each other (82.3% and 82.2% accuracy respectively). However,… view at source ↗

**Figure 7.** Figure 7: Real-time detection. We evaluate our model on alternating sequences for real-time detection. In (a), we show that (i) longer gaze sequences result in higher latency, (ii) RGB has lower latency than temporal signals (iii) adding RGB to gaze reduces the latency compared to gaze alone. We illustrate the results in (b). sions in terms of the complementary role between gaze and RGB, but IMU does not help as muc… view at source ↗

**Figure 8.** Figure 8: Noise robustness. Augmentation (red) lowers degradation. GT \Pred 1 2 3 4 5 6 7 1 No read 0.88 0.04 0.02 0.02 0.01 0.03 0.00 2 Walk 0.09 0.85 0.04 0.01 0.00 0.00 0.01 3 Out loud 0.13 0.02 0.64 0.17 0.02 0.01 0.01 4 Engaged 0.14 0.02 0.06 0.54 0.12 0.01 0.11 5 Scan 0.08 0.01 0.03 0.39 0.41 0.00 0.08 6 Write/type 0.49 0.01 0.03 0.02 0.05 0.39 0.01 7 Skim 0.13 0.04 0.05 0.47 0.15 0.00 0.16 [PITH_FULL_IMAGE:… view at source ↗

**Figure 9.** Figure 9: Distribution of reading materials in Seattle subset. In (a), we show the distribution of reading mediums. Within each medium (Print, Digital and Object), we then break down the reading materials in (b), (c), and (e). Additionally, for Digital media, we break down by the device involved in the recording. Refer to [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution of reading modes in Seattle subset together with illustrations. Almost half of our reading samples is engaged reading, while other scenarios diversely reflect how people read in different scenarios. B.3 Negative data We collected two types of negative data: (1) everyday activities that do not involve reading and (2) hard negatives where text is visible but not being read by the participant. E… view at source ↗

**Figure 11.** Figure 11: Distribution of negative data in Seattle subset. In addition to daily activities, our dataset also includes hard negatives, where the user has a text in scene but is not reading, making it indistinguishable using the RGB stream alone. B.4 Alternating sequences We also collected test sequences that alternate between reading and non-reading activities, allowing for the evaluation of temporal localization an… view at source ↗

**Figure 12.** Figure 12: Demographic statistics of the Seattle subset (i) shows the age group, (ii) shows gender distribution shown in [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Example of metadata for Seattle subset. The metadata contains several useful information to facilitate further research, including both multiple-choice questions and short answers. B.7 Protocols A successful data collection protocol ensures efficient collection while guaranteeing the quality. Reading is a complex process that includes word recognition, which encompasses visual processing and language deco… view at source ↗

**Figure 14.** Figure 14: Types of reading materials covered. The top row shows the distributions of (a) medium (b) text types (c) non-text content, in minutes. The bottom row shows the same distributions in terms of number of segments. ‘Medium’ indicates what kind of device or object the subject is reading from. The following is a list of reading materials included for different mediums. Similarly to the Seattle subset, we indica… view at source ↗

**Figure 15.** Figure 15: Platform distributions by medium. The top row shows the total duration in minutes for digital, print, and object platforms, while the bottom row displays the corresponding number of recordings. C.2 Type of reading modes covered In the Columbus subset, we have both instances of subject reading or positive cases and not reading or negative cases. However, similar to the Seattle subset, the mode of reading c… view at source ↗

**Figure 16.** Figure 16: Mirror Setup: Print Medium. Here subject is asked to read a comic vs. when asked to not read anything but instead look at pictures only. Reading Not reading [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Mirror Setup: Walking in Corridor. Here the subject is asked to read the room numbers and signs in a corridor vs when asked to traverse through normally. Reading (Engaged) Reading (Scanning) [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: Mirror Setup: Reading vs Searching. Here the subject is asked to read the serial numbers in a circuit board vs when asked to specifically search and count the number of resistors. C.5 Annotation To annotate the raw data and extract segments, we utilized a labeling tool developed in PyQt5 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 19.** Figure 19: Graphical interface of annotation tool. The user interface consists of the RGB video preview with sound and the gaze point overlaid on top shown in green. The interface allows adjusting the start and end time of each recording as well as annotating metadata like content-type, content-length and medium. C.6 Demographics We posted flyers in various locations to inform interested people about the study. Part… view at source ↗

**Figure 20.** Figure 20: Demographic statistics of the Columbus subset. An overview of the demographic distribution is shown: (i) Age Range, (ii) Gender, (iii) Native vs Non-native Speakers, (iv) Visual Aid Requirements, and (v) Education. Pre-session Preparation The process began with participants filling out a demographic questionnaire in a controlled indoor environment. Following this, an attendant explained how to calibrate t… view at source ↗

**Figure 21.** Figure 21: Example metadata for Columbus subset. The metadata contained is slightly different than that of the Seattle subset, allowing for different directions to explore for further research. C.8.3 Post-processing After the sessions, we generate draft previews of each session for segmentation. The previews are low quality RGB videos of the session with the gaze overlaid on the videos. We get the approximate segmen… view at source ↗

**Figure 22.** Figure 22: PR Curves for different modalities We compare the PR curves of different modalities. (a) shows the curves for individual modalities and (b) compares how combining different modalities influence performance The results presented in [PITH_FULL_IMAGE:figures/full_fig_p025_22.png] view at source ↗

**Figure 23.** Figure 23: Breakdown by content length and content type in Columbus subset. The figures show the Accuracy and F1 scores respectively of different combinations of content length (Paragraph vs Short text) and content type (Text only vs Includes non-text), across different modalities. We show that reading detection works better on paragraphs and text-only cases. D.2 Result breakdown Content type and content length [PI… view at source ↗

**Figure 24.** Figure 24: Breakdown by medium and content type in Columbus subset. The figures show the Accuracy and F1 scores respectively of different combinations of medium (Print vs Digital vs Object) and content type (Text only vs Includes non-text), across different modalities. We show that reading detection works better on print and digital media. Medium and content type [PITH_FULL_IMAGE:figures/full_fig_p026_24.png] view at source ↗

**Figure 25.** Figure 25: RGB fails but Gaze succeeds. Failure case across 6 frames where RGB Fails but Gaze works. Notice that the RGB crop indicated in red has partial coverage of the reading material. In [PITH_FULL_IMAGE:figures/full_fig_p028_25.png] view at source ↗

**Figure 26.** Figure 26: Gaze fails but RGB succeeds. Here, the subject is reading a map where text is irregularly placed across the field of view, making gaze patterns more sporadic. In contrast, [PITH_FULL_IMAGE:figures/full_fig_p028_26.png] view at source ↗

**Figure 27.** Figure 27: Individual modality fails, combined modality succeeds. Here, both gaze and RGB fail individually but succeed when combined. In [PITH_FULL_IMAGE:figures/full_fig_p029_27.png] view at source ↗

**Figure 28.** Figure 28: Misleading RGB: Gaze measurement error. Failure case across 6 frames where Gaze succeeds but RGB Only and Gaze with RGB fail. Note that here the eye gaze is offset due to measurement error, putting the RGB crop outside the reading material [PITH_FULL_IMAGE:figures/full_fig_p029_28.png] view at source ↗

**Figure 29.** Figure 29: Misleading RGB: Partial coverage. Failure case across 6 frames where Gaze succeeds but RGB Only and Gaze with RGB fail. Notice that the RGB crop indicated in red has partial coverage of the reading material. In [PITH_FULL_IMAGE:figures/full_fig_p029_29.png] view at source ↗

**Figure 30.** Figure 30: All Modality Failure case: Searching on Map. Failure case across 6 frames where all modalities and their combinations fail. Here the participant is searching for a particular name on map [PITH_FULL_IMAGE:figures/full_fig_p030_30.png] view at source ↗

**Figure 31.** Figure 31: All Modality Failure case: Reading room numbers while walking. Failure case across 6 frames where all modalities and their combinations fail. Note that here the subject is asked to read room numbers while walking. The gaze pattern scans across the corridor before reading the room number. Note that the model was not trained on this kind of scenario. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_31.png] view at source ↗

**Figure 33.** Figure 33: Not Reading [PITH_FULL_IMAGE:figures/full_fig_p032_33.png] view at source ↗

read the original abstract

To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New task and 100-hour multimodal dataset for reading recognition in egocentric video, but the abstract gives no numbers or ablations to support the complementarity claims.

read the letter

The main point here is a new task for recognizing when someone is reading using egocentric data from wearables, supported by a fresh 100-hour dataset. The abstract stops short of showing any results, which makes it tough to judge how well the approach works. What the paper does is introduce reading recognition as a distinct problem for always-on smart glasses. They gathered 100 hours of video in varied real-world conditions, covering both reading and non-reading cases. Three modalities stand out: the RGB view from the camera, eye gaze tracking, and head pose. A transformer model is set up to use these inputs flexibly, alone or fused together. They also explore classifying reading types at larger scale than before. This has value in pushing activity recognition toward more contextual uses in personal AI. Collecting data at that volume in unconstrained settings is a solid effort, and thinking about efficient encoding for each modality shows practical consideration for deployment. The weak part is the lack of evidence for the main claims. The abstract says the modalities are relevant and complementary, but without ablations, accuracy numbers, or error breakdowns, there's no way to confirm it. The idea that the dataset represents real-world diversity enough for good generalization is stated but not backed by any evaluation details in what we have. That makes the soundness hard to assess from the abstract. This kind of paper is useful for folks in egocentric vision and AR research who want benchmarks for reading-related interactions. It could help with building better contextual awareness in devices. I would send this to peer review. The dataset and task definition seem worth community input, particularly on evaluation methods, even though the current writeup needs the results section to make a stronger case.

Referee Report

2 major / 1 minor

Summary. The paper introduces the task of reading recognition in egocentric videos for always-on smart glasses, presents the Reading in the Wild dataset comprising 100 hours of multimodal reading and non-reading videos collected in diverse realistic scenarios, identifies three modalities (egocentric RGB, eye gaze, and head pose), proposes a flexible transformer model that processes these modalities individually or in combination, claims to demonstrate that the modalities are relevant and complementary, investigates efficient encoding strategies for each, and shows the dataset's utility for classifying types of reading at larger scale and realism than prior constrained studies.

Significance. If the unreported experiments confirm modality complementarity and dataset utility with appropriate ablations and generalization tests, the work could provide a valuable large-scale benchmark for multimodal egocentric vision and advance contextual AI applications in wearable devices by extending reading understanding beyond lab settings.

major comments (2)

[Abstract] Abstract: The claim that 'these modalities are relevant and complementary to the task' is load-bearing for the central contribution yet is asserted without any quantitative results, ablation studies, fusion performance deltas, error analysis, or evaluation metrics, making it impossible to verify whether the data supports the assertion.
[Abstract] Abstract: The load-bearing assumption that the 100 hours of videos 'in diverse and realistic scenarios' are representative enough to train and evaluate models that generalize is stated without details on collection protocol, annotation process, diversity quantification, or train/test splits, preventing assessment of potential biases or leakage.

minor comments (1)

[Abstract] Abstract: The description of the transformer as 'flexible' and the investigation of 'how to efficiently and effectively encode each modality' would benefit from at least high-level architectural or encoding details even in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our manuscript. We address each of the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'these modalities are relevant and complementary to the task' is load-bearing for the central contribution yet is asserted without any quantitative results, ablation studies, fusion performance deltas, error analysis, or evaluation metrics, making it impossible to verify whether the data supports the assertion.

Authors: We agree that the abstract, being a concise summary, does not include specific quantitative evidence. The full manuscript contains sections with ablation studies, performance comparisons for individual and combined modalities, and metrics that support the relevance and complementarity of egocentric RGB, eye gaze, and head pose. To improve clarity, we will revise the abstract to include a brief statement referencing these experimental findings. revision: yes
Referee: [Abstract] Abstract: The load-bearing assumption that the 100 hours of videos 'in diverse and realistic scenarios' are representative enough to train and evaluate models that generalize is stated without details on collection protocol, annotation process, diversity quantification, or train/test splits, preventing assessment of potential biases or leakage.

Authors: The abstract summarizes the key aspects of the dataset. Detailed information on the collection protocol, annotation process, measures of diversity, and the train/test splits (designed to prevent data leakage) is provided in the main body of the paper. We acknowledge that incorporating a short reference to these elements in the abstract would help address concerns about generalization and potential biases. We will make this revision. revision: yes

Circularity Check

0 steps flagged

No circularity: new dataset and empirical claims with no derivation chain present

full rationale

Only the abstract is available, which introduces a new task, a 100-hour multimodal dataset, and a flexible transformer using RGB, gaze, and pose modalities. No equations, fitted parameters, self-citations, or derivations appear that could reduce a claimed result to its own inputs by construction. The contribution is framed as data collection plus empirical model application, which is self-contained against external benchmarks and contains none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only information provides no identifiable free parameters, axioms, or invented entities; the work appears to rely on standard multimodal learning techniques applied to a new dataset.

pith-pipeline@v0.9.0 · 5713 in / 1039 out tokens · 52012 ms · 2026-05-19T12:30:19.929143+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a flexible multimodal transformer model that takes in different modalities as input... three layers of 1D (gaze and IMU) and 2D (RGB) convolutions... modality dropout... 137k parameters.
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task... investigate how to efficiently and effectively encode each modality.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

[1]

Towards predicting reading comprehension from gaze behavior

Seoyoung Ahn, Conor Kelton, Aruna Balasubramanian, and Greg Zelinsky. Towards predicting reading comprehension from gaze behavior. InETRA, 2020

work page 2020
[2]

Whisperx: Time-accurate speech transcrip- tion of long-form audio.INTERSPEECH, 2023

Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcrip- tion of long-form audio.INTERSPEECH, 2023

work page 2023
[3]

A robust realtime reading-skimming classifier

Ralf Biedert, Jörn Hees, Andreas Dengel, and Georg Buscher. A robust realtime reading-skimming classifier. InETRA, 2012

work page 2012
[4]

Robust recognition of reading activ- ity in transit using wearable electrooculography

Andreas Bulling, Jamie A Ward, Hans Gellersen, and Gerhard Tröster. Robust recognition of reading activ- ity in transit using wearable electrooculography. InPervasive Computing: 6th International Conference, Pervasive 2008 Sydney, Australia, May 19-22, 2008 Proceedings 6, pages 19–37. Springer, 2008

work page 2008
[5]

Visual attentional training improves reading capabilities in children with dyslexia: An eye tracker study during a reading task.Brain sciences, 10(8), 2020

Simona Caldani, Christophe-Loïc Gerard, Hugo Peyre, and Maria Pia Bucci. Visual attentional training improves reading capabilities in children with dyslexia: An eye tracker study during a reading task.Brain sciences, 10(8), 2020

work page 2020
[6]

A robust algorithm for reading detection

Christopher S Campbell and Paul P Maglio. A robust algorithm for reading detection. InPUI, 2001

work page 2001
[7]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

work page 2017
[8]

Improving the understanding of web user behaviors through machine learning analysis of eye-tracking data.User Modeling and User-Adapted Interaction, 34(2), 2024

Diana Castilla, Omar Del Tejo Catalá, Patricia Pons, François Signol, Beatriz Rey, Carlos Suso-Ribera, and Juan-Carlos Perez-Cortes. Improving the understanding of web user behaviors through machine learning analysis of eye-tracking data.User Modeling and User-Adapted Interaction, 34(2), 2024

work page 2024
[9]

Gazexplain: Learning to predict natural language explanations of visual scanpaths

Xianyu Chen, Ming Jiang, and Qi Zhao. Gazexplain: Learning to predict natural language explanations of visual scanpaths. InECCV, 2024

work page 2024
[10]

Predicting reading comprehension scores from eye movements using artificial neural networks and fuzzy output error.Artif

Leana Copeland, Tom Gedeon, and B Sumudu U Mendis. Predicting reading comprehension scores from eye movements using artificial neural networks and fuzzy output error.Artif. Intell. Res., 3(3), 2014

work page 2014
[11]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. InECCV, 2018

work page 2018
[12]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, Cheng Peng, Chris Sweeney, Cole Wilson, Dan Barnes, Daniel DeTone, David Caruso, Derek Valleroy, Dinesh Ginjupalli, Duncan Frost, Edward Miller, Elias Mueggler, Evgeniy Oleinik, Fan Zhang, Guruprasad So...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Oat: Object-level attention transformer for gaze scanpath prediction

Yini Fang, Jingling Yu, Haozheng Zhang, Ralf van der Lans, and Bertram Shi. Oat: Object-level attention transformer for gaze scanpath prediction. InECCV, 2024

work page 2024
[14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Gird- har, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

work page 2022
[16]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024

work page 2024
[17]

Zuco, a simultaneous eeg and eye-tracking resource for natural sentence reading.Scientific data, 5(1), 2018

Nora Hollenstein, Jonathan Rotsztejn, Marius Troendle, Andreas Pedroni, Ce Zhang, and Nicolas Langer. Zuco, a simultaneous eeg and eye-tracking resource for natural sentence reading.Scientific data, 5(1), 2018

work page 2018
[18]

Involution fused convnet for classifying eye- tracking patterns of children with autism spectrum disorder.Engineering Applications of Artificial Intelligence, 2025

Md Farhadul Islam, Meem Arafat Manab, Joyanta Jyoti Mondal, Sarah Zabeen, Fardin Bin Rahman, Md Zahidul Hasan, Farig Sadeque, and Jannatun Noor. Involution fused convnet for classifying eye- tracking patterns of children with autism spectrum disorder.Engineering Applications of Artificial Intelligence, 2025

work page 2025
[19]

Icdar 2024 competition on reading documents through aria glasses

Soumya Shamarao Jahagirdar, Ajoy Mondal, Yuheng Ren, Omkar M Parkhi, and CV Jawahar. Icdar 2024 competition on reading documents through aria glasses. InICDAR, 2024

work page 2024
[20]

Boosting gaze object prediction via pixel-level supervision from vision foundation model

Yang Jin, Lei Zhang, Shi Yan, Bin Fan, and Binglu Wang. Boosting gaze object prediction via pixel-level supervision from vision foundation model. InECCV, 2024

work page 2024
[21]

A theory of reading: from eye fixations to comprehension

Marcel A Just and Patricia A Carpenter. A theory of reading: from eye fixations to comprehension. Psychological review, 87(4), 1980

work page 1980
[22]

Epic-fusion: Audio-visual temporal binding for egocentric action recognition

Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. InICCV, 2019

work page 2019
[23]

Reading detection in real-time

Conor Kelton, Zijun Wei, Seoyoung Ahn, Aruna Balasubramanian, Samir R Das, Dimitris Samaras, and Gregory Zelinsky. Reading detection in real-time. InETRA, 2019

work page 2019
[24]

Enhancing human-computer interaction in chest x-ray analysis using vision and language model with eye gaze patterns

Yunsoo Kim, Jinge Wu, Yusuf Abdulle, Yue Gao, and Honghan Wu. Enhancing human-computer interaction in chest x-ray analysis using vision and language model with eye gaze patterns. InMICCAI, 2024

work page 2024
[25]

Gaze-detr: Using expert gaze to reduce false positives in vulvovaginal candidiasis screening

Yan Kong, Sheng Wang, Jiangdong Cai, Zihao Zhao, Zhenrong Shen, Yonghao Li, Manman Fei, and Qian Wang. Gaze-detr: Using expert gaze to reduce false positives in vulvovaginal candidiasis screening. In MICCAI, 2024

work page 2024
[26]

Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear.arXiv preprint arXiv:2401.17217, 2024

Robert Konrad, Nitish Padmanaban, J Gabriel Buckmaster, Kevin C Boyle, and Gordon Wetzstein. Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear.arXiv preprint arXiv:2401.17217, 2024

work page arXiv 2024
[27]

I know what you are reading: recognition of document types using mobile eye tracking

Kai Kunze, Yuzuko Utsumi, Yuki Shiga, Koichi Kise, and Andreas Bulling. I know what you are reading: recognition of document types using mobile eye tracking. InISWC, 2013

work page 2013
[28]

Classification of reading and not reading behavior based on eye movement analysis

Manuel Landsmann, Olivier Augereau, and Koichi Kise. Classification of reading and not reading behavior based on eye movement analysis. InISWC, 2019

work page 2019
[29]

In the eye of beholder: Joint learning of gaze and actions in first person video

Yin Li, Miao Liu, and James M Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. InECCV, 2018

work page 2018
[30]

Classification of reading patterns based on gaze information

Wen-Hung Liao, Chin-Wen Chang, and Yi-Chieh Wu. Classification of reading patterns based on gaze information. In2017 IEEE International Symposium on Multimedia (ISM). IEEE, 2017. 12

work page 2017
[31]

Gazehta: End-to-end gaze target detection with head-target association.arXiv preprint arXiv:2404.10718, 2024

Zhi-Yi Lin, Jouh Yeong Chew, Jan van Gemert, and Xucong Zhang. Gazehta: End-to-end gaze target detection with head-target association.arXiv preprint arXiv:2404.10718, 2024

work page arXiv 2024
[32]

Gem: Context-aware gaze estimation with visual search behavior matching for chest radiograph

Shaonan Liu, Wenting Chen, Jie Liu, Xiaoling Luo, and Linlin Shen. Gem: Context-aware gaze estimation with visual search behavior matching for chest radiograph. InMICCAI, 2024

work page 2024
[33]

Using eye-tracking measures to predict reading comprehension.Reading Research Quarterly, 58(3), 2023

Diane C Mézière, Lili Yu, Erik D Reichle, Titus V on Der Malsburg, and Genevieve McArthur. Using eye-tracking measures to predict reading comprehension.Reading Research Quarterly, 58(3), 2023

work page 2023
[34]

Integrating human gaze into attention for egocentric activity recognition

Kyle Min and Jason J Corso. Integrating human gaze into attention for egocentric activity recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1069–1078, 2021

work page 2021
[35]

Look hear: Gaze prediction for speech-directed human attention

Sounak Mondal, Seoyoung Ahn, Zhibo Yang, Niranjan Balasubramanian, Dimitris Samaras, Gregory Zelinsky, and Minh Hoai. Look hear: Gaze prediction for speech-directed human attention. InECCV, 2024

work page 2024
[36]

Thumb’s rule tested: Visual angle of thumb’s width is about 2 deg.Perception, 20, 1991

Robert O’Shea. Thumb’s rule tested: Visual angle of thumb’s width is about 2 deg.Perception, 20, 1991

work page 1991
[37]

A transformer-based model for the prediction of human gaze behavior on videos

Süleyman Özdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, and Enkelejda Kasneci. A transformer-based model for the prediction of human gaze behavior on videos. InETRA, 2024

work page 2024
[38]

Egoblur: Responsible innovation in aria, 2023

Nikhil Raina, Guruprasad Somasundaram, Kang Zheng, Sagar Miglani, Steve Saarinen, Jeff Meissner, Mark Schwesinger, Luis Pesqueira, Ishita Prasad, Edward Miller, Prince Gupta, Mingfei Yan, Richard Newcombe, Carl Ren, and Omkar M Parkhi. Egoblur: Responsible innovation in aria, 2023

work page 2023
[39]

On understanding creative language: the late positive complex and novel metaphor comprehension.Brain research, 1678, 2018

Karolina Rataj, Anna Przekoracka-Krawczyk, and Rob HJ Van der Lubbe. On understanding creative language: the late positive complex and novel metaphor comprehension.Brain research, 1678, 2018

work page 2018
[40]

Learning user embeddings from human gaze for personalised saliency prediction

Florian Strohm, Mihai Bâce, and Andreas Bulling. Learning user embeddings from human gaze for personalised saliency prediction. InETRA, 2024

work page 2024
[41]

Sara: Smart ai reading assistant for reading comprehension

Enkeleda Thaqi, Mohamed Omar Mantawy, and Enkelejda Kasneci. Sara: Smart ai reading assistant for reading comprehension. InETRA, 2024

work page 2024
[42]

Gaze-guided hand-object interaction synthesis: Benchmark and method.arXiv preprint arXiv:2403.16169, 2024

Jie Tian, Lingxiao Yang, Ran Ji, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, and Jingya Wang. Gaze-guided hand-object interaction synthesis: Benchmark and method.arXiv preprint arXiv:2403.16169, 2024

work page arXiv 2024
[43]

Gazeprompt: Enhancing low vision people’s reading experience with gaze-aware augmentations

Ru Wang, Zach Potter, Yun Ho, Daniel Killough, Linxiu Zeng, Sanbrita Mondal, and Yuhang Zhao. Gazeprompt: Enhancing low vision people’s reading experience with gaze-aware augmentations. InCHI Conference on Human Factors in Computing Systems, 2024

work page 2024
[44]

Gaze-directed vision gnn for mitigating shortcut learning in medical image

Shaoxuan Wu, Xiao Zhang, Bin Wang, Zhuo Jin, Hansheng Li, and Jun Feng. Gaze-directed vision gnn for mitigating shortcut learning in medical image. InMICCAI, 2024

work page 2024
[45]

Fast and accurate text classification: Skimming, rereading and early stopping

Keyi Yu, Yang Liu, Alexander G Schwing, and Jian Peng. Fast and accurate text classification: Skimming, rereading and early stopping. InICLR, 2018

work page 2018
[46]

Interead: An eye tracking dataset of interrupted reading

Francesca Zermiani, Prajit Dhar, Ekta Sood, Fabian Kögel, Andreas Bulling, and Maria Wirzberger. Interead: An eye tracking dataset of interrupted reading. InLREC-COLING, 2024

work page 2024
[47]

Can gaze inform egocentric action recognition? InETRA, 2022

Zehua Zhang, David Crandall, Michael Proulx, Sachin Talathi, and Abhishek Sharma. Can gaze inform egocentric action recognition? InETRA, 2022

work page 2022
[48]

name": "13. Write or type texts - Read Out Loud 21

Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. Learning video representations from large language models. InCVPR, 2023. 13 Reading Recognition in the Wild —Supplementary Material— A Introduction Additional dataset details.Our dataset is the first instance of reading activity recognition dataset in unconstrained environments and is also the ...

work page 2023
[49]

Enabled modalities Gaze ✗ ✗✓ ✓ RGB ✓ ✓✗✓ IMU ✗ ✗ ✗✓ Fusion ✓✗ ✗✓

work page
[50]

flow) 0.310 0.545

On-device Feasibility ✗ ✗✓ ✓ Number of parameters 11B 25M 1k 130k Sensing cost (power) high high low low RGB requirements full RGB full RGB video - foveated patch (5° FoV) (dominates sensing cost) (optional) Real-time ✗ ✗✓ ✓ Inference time (ms) 567.410 895.511 (incl. flow) 0.310 0.545

work page
[51]

Inspect/read recipe

Performance Zero-shot capability ✓✗ ✗✓ Acc / F1 on RiTW Columbus 76.7 / 65.6 - -82.9 / 88.8 Acc / F1 on EGTEA dataset 89.6 / 61.5 88.8 / 65.8 85.8 / 62.889.6 / 70.6 Table 13:Comparison of alternative methods.This table compares approaches for reading recognition, including (i) vision-language models (VLMs), (ii) action recognition models, and (iii) altern...

work page

[1] [1]

Towards predicting reading comprehension from gaze behavior

Seoyoung Ahn, Conor Kelton, Aruna Balasubramanian, and Greg Zelinsky. Towards predicting reading comprehension from gaze behavior. InETRA, 2020

work page 2020

[2] [2]

Whisperx: Time-accurate speech transcrip- tion of long-form audio.INTERSPEECH, 2023

Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcrip- tion of long-form audio.INTERSPEECH, 2023

work page 2023

[3] [3]

A robust realtime reading-skimming classifier

Ralf Biedert, Jörn Hees, Andreas Dengel, and Georg Buscher. A robust realtime reading-skimming classifier. InETRA, 2012

work page 2012

[4] [4]

Robust recognition of reading activ- ity in transit using wearable electrooculography

Andreas Bulling, Jamie A Ward, Hans Gellersen, and Gerhard Tröster. Robust recognition of reading activ- ity in transit using wearable electrooculography. InPervasive Computing: 6th International Conference, Pervasive 2008 Sydney, Australia, May 19-22, 2008 Proceedings 6, pages 19–37. Springer, 2008

work page 2008

[5] [5]

Visual attentional training improves reading capabilities in children with dyslexia: An eye tracker study during a reading task.Brain sciences, 10(8), 2020

Simona Caldani, Christophe-Loïc Gerard, Hugo Peyre, and Maria Pia Bucci. Visual attentional training improves reading capabilities in children with dyslexia: An eye tracker study during a reading task.Brain sciences, 10(8), 2020

work page 2020

[6] [6]

A robust algorithm for reading detection

Christopher S Campbell and Paul P Maglio. A robust algorithm for reading detection. InPUI, 2001

work page 2001

[7] [7]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

work page 2017

[8] [8]

Improving the understanding of web user behaviors through machine learning analysis of eye-tracking data.User Modeling and User-Adapted Interaction, 34(2), 2024

Diana Castilla, Omar Del Tejo Catalá, Patricia Pons, François Signol, Beatriz Rey, Carlos Suso-Ribera, and Juan-Carlos Perez-Cortes. Improving the understanding of web user behaviors through machine learning analysis of eye-tracking data.User Modeling and User-Adapted Interaction, 34(2), 2024

work page 2024

[9] [9]

Gazexplain: Learning to predict natural language explanations of visual scanpaths

Xianyu Chen, Ming Jiang, and Qi Zhao. Gazexplain: Learning to predict natural language explanations of visual scanpaths. InECCV, 2024

work page 2024

[10] [10]

Predicting reading comprehension scores from eye movements using artificial neural networks and fuzzy output error.Artif

Leana Copeland, Tom Gedeon, and B Sumudu U Mendis. Predicting reading comprehension scores from eye movements using artificial neural networks and fuzzy output error.Artif. Intell. Res., 3(3), 2014

work page 2014

[11] [11]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. InECCV, 2018

work page 2018

[12] [12]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, Cheng Peng, Chris Sweeney, Cole Wilson, Dan Barnes, Daniel DeTone, David Caruso, Derek Valleroy, Dinesh Ginjupalli, Duncan Frost, Edward Miller, Elias Mueggler, Evgeniy Oleinik, Fan Zhang, Guruprasad So...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Oat: Object-level attention transformer for gaze scanpath prediction

Yini Fang, Jingling Yu, Haozheng Zhang, Ralf van der Lans, and Bertram Shi. Oat: Object-level attention transformer for gaze scanpath prediction. InECCV, 2024

work page 2024

[14] [14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Gird- har, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

work page 2022

[16] [16]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024

work page 2024

[17] [17]

Zuco, a simultaneous eeg and eye-tracking resource for natural sentence reading.Scientific data, 5(1), 2018

Nora Hollenstein, Jonathan Rotsztejn, Marius Troendle, Andreas Pedroni, Ce Zhang, and Nicolas Langer. Zuco, a simultaneous eeg and eye-tracking resource for natural sentence reading.Scientific data, 5(1), 2018

work page 2018

[18] [18]

Involution fused convnet for classifying eye- tracking patterns of children with autism spectrum disorder.Engineering Applications of Artificial Intelligence, 2025

Md Farhadul Islam, Meem Arafat Manab, Joyanta Jyoti Mondal, Sarah Zabeen, Fardin Bin Rahman, Md Zahidul Hasan, Farig Sadeque, and Jannatun Noor. Involution fused convnet for classifying eye- tracking patterns of children with autism spectrum disorder.Engineering Applications of Artificial Intelligence, 2025

work page 2025

[19] [19]

Icdar 2024 competition on reading documents through aria glasses

Soumya Shamarao Jahagirdar, Ajoy Mondal, Yuheng Ren, Omkar M Parkhi, and CV Jawahar. Icdar 2024 competition on reading documents through aria glasses. InICDAR, 2024

work page 2024

[20] [20]

Boosting gaze object prediction via pixel-level supervision from vision foundation model

Yang Jin, Lei Zhang, Shi Yan, Bin Fan, and Binglu Wang. Boosting gaze object prediction via pixel-level supervision from vision foundation model. InECCV, 2024

work page 2024

[21] [21]

A theory of reading: from eye fixations to comprehension

Marcel A Just and Patricia A Carpenter. A theory of reading: from eye fixations to comprehension. Psychological review, 87(4), 1980

work page 1980

[22] [22]

Epic-fusion: Audio-visual temporal binding for egocentric action recognition

Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. InICCV, 2019

work page 2019

[23] [23]

Reading detection in real-time

Conor Kelton, Zijun Wei, Seoyoung Ahn, Aruna Balasubramanian, Samir R Das, Dimitris Samaras, and Gregory Zelinsky. Reading detection in real-time. InETRA, 2019

work page 2019

[24] [24]

Enhancing human-computer interaction in chest x-ray analysis using vision and language model with eye gaze patterns

Yunsoo Kim, Jinge Wu, Yusuf Abdulle, Yue Gao, and Honghan Wu. Enhancing human-computer interaction in chest x-ray analysis using vision and language model with eye gaze patterns. InMICCAI, 2024

work page 2024

[25] [25]

Gaze-detr: Using expert gaze to reduce false positives in vulvovaginal candidiasis screening

Yan Kong, Sheng Wang, Jiangdong Cai, Zihao Zhao, Zhenrong Shen, Yonghao Li, Manman Fei, and Qian Wang. Gaze-detr: Using expert gaze to reduce false positives in vulvovaginal candidiasis screening. In MICCAI, 2024

work page 2024

[26] [26]

Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear.arXiv preprint arXiv:2401.17217, 2024

Robert Konrad, Nitish Padmanaban, J Gabriel Buckmaster, Kevin C Boyle, and Gordon Wetzstein. Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear.arXiv preprint arXiv:2401.17217, 2024

work page arXiv 2024

[27] [27]

I know what you are reading: recognition of document types using mobile eye tracking

Kai Kunze, Yuzuko Utsumi, Yuki Shiga, Koichi Kise, and Andreas Bulling. I know what you are reading: recognition of document types using mobile eye tracking. InISWC, 2013

work page 2013

[28] [28]

Classification of reading and not reading behavior based on eye movement analysis

Manuel Landsmann, Olivier Augereau, and Koichi Kise. Classification of reading and not reading behavior based on eye movement analysis. InISWC, 2019

work page 2019

[29] [29]

In the eye of beholder: Joint learning of gaze and actions in first person video

Yin Li, Miao Liu, and James M Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. InECCV, 2018

work page 2018

[30] [30]

Classification of reading patterns based on gaze information

Wen-Hung Liao, Chin-Wen Chang, and Yi-Chieh Wu. Classification of reading patterns based on gaze information. In2017 IEEE International Symposium on Multimedia (ISM). IEEE, 2017. 12

work page 2017

[31] [31]

Gazehta: End-to-end gaze target detection with head-target association.arXiv preprint arXiv:2404.10718, 2024

Zhi-Yi Lin, Jouh Yeong Chew, Jan van Gemert, and Xucong Zhang. Gazehta: End-to-end gaze target detection with head-target association.arXiv preprint arXiv:2404.10718, 2024

work page arXiv 2024

[32] [32]

Gem: Context-aware gaze estimation with visual search behavior matching for chest radiograph

Shaonan Liu, Wenting Chen, Jie Liu, Xiaoling Luo, and Linlin Shen. Gem: Context-aware gaze estimation with visual search behavior matching for chest radiograph. InMICCAI, 2024

work page 2024

[33] [33]

Using eye-tracking measures to predict reading comprehension.Reading Research Quarterly, 58(3), 2023

Diane C Mézière, Lili Yu, Erik D Reichle, Titus V on Der Malsburg, and Genevieve McArthur. Using eye-tracking measures to predict reading comprehension.Reading Research Quarterly, 58(3), 2023

work page 2023

[34] [34]

Integrating human gaze into attention for egocentric activity recognition

Kyle Min and Jason J Corso. Integrating human gaze into attention for egocentric activity recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1069–1078, 2021

work page 2021

[35] [35]

Look hear: Gaze prediction for speech-directed human attention

Sounak Mondal, Seoyoung Ahn, Zhibo Yang, Niranjan Balasubramanian, Dimitris Samaras, Gregory Zelinsky, and Minh Hoai. Look hear: Gaze prediction for speech-directed human attention. InECCV, 2024

work page 2024

[36] [36]

Thumb’s rule tested: Visual angle of thumb’s width is about 2 deg.Perception, 20, 1991

Robert O’Shea. Thumb’s rule tested: Visual angle of thumb’s width is about 2 deg.Perception, 20, 1991

work page 1991

[37] [37]

A transformer-based model for the prediction of human gaze behavior on videos

Süleyman Özdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, and Enkelejda Kasneci. A transformer-based model for the prediction of human gaze behavior on videos. InETRA, 2024

work page 2024

[38] [38]

Egoblur: Responsible innovation in aria, 2023

Nikhil Raina, Guruprasad Somasundaram, Kang Zheng, Sagar Miglani, Steve Saarinen, Jeff Meissner, Mark Schwesinger, Luis Pesqueira, Ishita Prasad, Edward Miller, Prince Gupta, Mingfei Yan, Richard Newcombe, Carl Ren, and Omkar M Parkhi. Egoblur: Responsible innovation in aria, 2023

work page 2023

[39] [39]

On understanding creative language: the late positive complex and novel metaphor comprehension.Brain research, 1678, 2018

Karolina Rataj, Anna Przekoracka-Krawczyk, and Rob HJ Van der Lubbe. On understanding creative language: the late positive complex and novel metaphor comprehension.Brain research, 1678, 2018

work page 2018

[40] [40]

Learning user embeddings from human gaze for personalised saliency prediction

Florian Strohm, Mihai Bâce, and Andreas Bulling. Learning user embeddings from human gaze for personalised saliency prediction. InETRA, 2024

work page 2024

[41] [41]

Sara: Smart ai reading assistant for reading comprehension

Enkeleda Thaqi, Mohamed Omar Mantawy, and Enkelejda Kasneci. Sara: Smart ai reading assistant for reading comprehension. InETRA, 2024

work page 2024

[42] [42]

Gaze-guided hand-object interaction synthesis: Benchmark and method.arXiv preprint arXiv:2403.16169, 2024

Jie Tian, Lingxiao Yang, Ran Ji, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, and Jingya Wang. Gaze-guided hand-object interaction synthesis: Benchmark and method.arXiv preprint arXiv:2403.16169, 2024

work page arXiv 2024

[43] [43]

Gazeprompt: Enhancing low vision people’s reading experience with gaze-aware augmentations

Ru Wang, Zach Potter, Yun Ho, Daniel Killough, Linxiu Zeng, Sanbrita Mondal, and Yuhang Zhao. Gazeprompt: Enhancing low vision people’s reading experience with gaze-aware augmentations. InCHI Conference on Human Factors in Computing Systems, 2024

work page 2024

[44] [44]

Gaze-directed vision gnn for mitigating shortcut learning in medical image

Shaoxuan Wu, Xiao Zhang, Bin Wang, Zhuo Jin, Hansheng Li, and Jun Feng. Gaze-directed vision gnn for mitigating shortcut learning in medical image. InMICCAI, 2024

work page 2024

[45] [45]

Fast and accurate text classification: Skimming, rereading and early stopping

Keyi Yu, Yang Liu, Alexander G Schwing, and Jian Peng. Fast and accurate text classification: Skimming, rereading and early stopping. InICLR, 2018

work page 2018

[46] [46]

Interead: An eye tracking dataset of interrupted reading

Francesca Zermiani, Prajit Dhar, Ekta Sood, Fabian Kögel, Andreas Bulling, and Maria Wirzberger. Interead: An eye tracking dataset of interrupted reading. InLREC-COLING, 2024

work page 2024

[47] [47]

Can gaze inform egocentric action recognition? InETRA, 2022

Zehua Zhang, David Crandall, Michael Proulx, Sachin Talathi, and Abhishek Sharma. Can gaze inform egocentric action recognition? InETRA, 2022

work page 2022

[48] [48]

name": "13. Write or type texts - Read Out Loud 21

Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. Learning video representations from large language models. InCVPR, 2023. 13 Reading Recognition in the Wild —Supplementary Material— A Introduction Additional dataset details.Our dataset is the first instance of reading activity recognition dataset in unconstrained environments and is also the ...

work page 2023

[49] [49]

Enabled modalities Gaze ✗ ✗✓ ✓ RGB ✓ ✓✗✓ IMU ✗ ✗ ✗✓ Fusion ✓✗ ✗✓

work page

[50] [50]

flow) 0.310 0.545

On-device Feasibility ✗ ✗✓ ✓ Number of parameters 11B 25M 1k 130k Sensing cost (power) high high low low RGB requirements full RGB full RGB video - foveated patch (5° FoV) (dominates sensing cost) (optional) Real-time ✗ ✗✓ ✓ Inference time (ms) 567.410 895.511 (incl. flow) 0.310 0.545

work page

[51] [51]

Inspect/read recipe

Performance Zero-shot capability ✓✗ ✗✓ Acc / F1 on RiTW Columbus 76.7 / 65.6 - -82.9 / 88.8 Acc / F1 on EGTEA dataset 89.6 / 61.5 88.8 / 65.8 85.8 / 62.889.6 / 70.6 Table 13:Comparison of alternative methods.This table compares approaches for reading recognition, including (i) vision-language models (VLMs), (ii) action recognition models, and (iii) altern...

work page