Predicting video saliency using crowdsourced mouse-tracking data
Pith reviewed 2026-05-25 12:23 UTC · model grok-4.3
The pith
Crowdsourced mouse-tracking data collected through a cursor-contingent viewing system can approximate eye-tracking data for video saliency maps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We designed a mouse-contingent video viewing system which simulates the viewers' peripheral vision based on the position of the mouse cursor. The system enables the use of mouse-tracking data recorded from an ordinary computer mouse as an alternative to real gaze fixations recorded by a more expensive eye-tracker. We developed a crowdsourcing system that enables the collection of such mouse-tracking data at large scale. Using the collected mouse-tracking data we showed that it can serve as an approximation of eye-tracking data. Moreover, trying to increase the efficiency of collected mouse-tracking data we proposed a novel deep neural network algorithm that improves the quality of mouse-
What carries the argument
The mouse-contingent video viewing system that simulates peripheral vision from mouse cursor position, turning mouse movements into a proxy for gaze fixations used to build saliency maps.
If this is right
- Mouse-tracking data gathered via crowdsourcing serves as a scalable, low-cost approximation to eye-tracking data for video saliency.
- A dedicated deep neural network can measurably raise the quality of saliency maps derived from mouse-tracking inputs.
- Large-scale video saliency datasets become feasible to collect without eye-tracking hardware.
- Saliency prediction models can be trained on substantially bigger and more varied video sets assembled this way.
Where Pith is reading between the lines
- The same cursor-contingent approach could be adapted to collect attention data for other dynamic visual tasks where eye-trackers are impractical.
- The DNN refinement step implies that mouse data contains learnable, systematic deviations from true gaze that can be corrected algorithmically.
- Performance of the approximation may vary with video content type, suggesting targeted validation on fast-motion or low-contrast scenes.
- Hybrid training that mixes mouse-derived maps with smaller eye-tracking sets might improve model generalization beyond either data source alone.
Load-bearing premise
The mouse-contingent viewing system accurately simulates viewers' peripheral vision based on the mouse cursor position, so mouse movements reliably stand in for actual gaze fixations.
What would settle it
Side-by-side quantitative comparison of saliency maps produced from the crowdsourced mouse data against maps from simultaneous eye-tracking recordings on identical videos, checking agreement in fixation locations and saliency values.
Figures
read the original abstract
This paper presents a new way of getting high-quality saliency maps for video, using a cheaper alternative to eye-tracking data. We designed a mouse-contingent video viewing system which simulates the viewers' peripheral vision based on the position of the mouse cursor. The system enables the use of mouse-tracking data recorded from an ordinary computer mouse as an alternative to real gaze fixations recorded by a more expensive eye-tracker. We developed a crowdsourcing system that enables the collection of such mouse-tracking data at large scale. Using the collected mouse-tracking data we showed that it can serve as an approximation of eye-tracking data. Moreover, trying to increase the efficiency of collected mouse-tracking data we proposed a novel deep neural network algorithm that improves the quality of mouse-tracking saliency maps.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a mouse-contingent video viewing system that applies peripheral blur based on mouse cursor position to enable collection of crowdsourced mouse-tracking data as a low-cost proxy for eye-tracking saliency maps on videos. It asserts that the collected mouse data approximates eye-tracking data and proposes a novel DNN to improve the quality of the resulting saliency maps.
Significance. If the mouse-to-eye approximation holds with strong quantitative support, the work would be significant for computer vision by enabling scalable, low-cost collection of video saliency data via crowdsourcing, which could expand training sets for saliency prediction models. The crowdsourcing platform itself represents a practical engineering contribution.
major comments (2)
- [Abstract] Abstract: the central claim that 'mouse-tracking data ... can serve as an approximation of eye-tracking data' is load-bearing yet unsupported by any reported quantitative metrics (AUC, NSS, KL divergence, or correlation) or direct comparison on identical stimuli; the description supplies no validation details or baselines.
- [Abstract] Abstract: the mouse-contingent system is presented as simulating peripheral vision, but the manuscript provides no evidence that cursor-based blur replicates saccadic dynamics, covert attention, or natural gaze trajectories; this untested fidelity is required for the proxy claim to hold.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We agree that the abstract should better support the central claims with quantitative details from the manuscript and will revise it accordingly. We address each point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'mouse-tracking data ... can serve as an approximation of eye-tracking data' is load-bearing yet unsupported by any reported quantitative metrics (AUC, NSS, KL divergence, or correlation) or direct comparison on identical stimuli; the description supplies no validation details or baselines.
Authors: The abstract is a concise summary; the manuscript reports direct comparisons on identical stimuli with quantitative metrics (AUC, NSS, KL divergence, and correlation) in the experimental results and figures. To address the concern, we will revise the abstract to include key validation metrics and baselines. revision: yes
-
Referee: [Abstract] Abstract: the mouse-contingent system is presented as simulating peripheral vision, but the manuscript provides no evidence that cursor-based blur replicates saccadic dynamics, covert attention, or natural gaze trajectories; this untested fidelity is required for the proxy claim to hold.
Authors: The system applies cursor-based peripheral blur to enable scalable data collection, with proxy validity shown empirically via saliency map approximation rather than exact replication of saccades or covert attention. We will revise the abstract to clarify the system's design scope and empirical support without overstating fidelity. revision: partial
Circularity Check
No circularity: empirical data collection with no derivation chain
full rationale
The paper's core contribution is an empirical crowdsourcing pipeline for mouse-tracking saliency data plus a DNN post-processing step; the abstract and description contain no equations, fitted parameters, or mathematical derivations. Claims rest on direct collection and comparison to eye-tracking, which are externally falsifiable and do not reduce to self-definition or self-citation. No load-bearing uniqueness theorems, ansatzes, or renamed known results appear. This is the normal non-circular case for an applied data-collection study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Predicting video saliency using crowdsourced mouse-tracking data
Introduction When watching videos, humans distribute their at- tention unevenly. Some objects in the video may at- tract more attention than the others. This distribu- tion can be represented by per-frame saliency maps defining the importance of each frame region for view- ers. The use of saliency can improve the quality of many video processing applicatio...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[2]
Hereafter we provide a brief overview of these topics
Related work The paper makes a contribution to two topics: cursor-based alternatives to eye tracking and semiau- tomatic saliency modeling. Hereafter we provide a brief overview of these topics. Cursor-based alternatives to eye tracking. There were many efforts to use mouse tracking as a cheap alternative to eye tracking. However, most of these efforts were...
-
[3]
We show a participant the video in a special video player in real-time in full-screen mode
Cursor-based saliency for video We propose a methodology for high-quality visual- attention estimation based on mouse-tracking data and a system collecting such data using crowdsourc- ing platforms. We show a participant the video in a special video player in real-time in full-screen mode. Input frames Dilated ResNet Conv LSTM Conv 1x1 Spatial features Te...
-
[4]
Semiautomatic deep neural network To improve saliency maps generated using the cur- sor positions as eye fixations we developed a new neu- ral network algorithm. The algorithm is based on SAM [11] architecture which was originally designed to predict saliency of static images. Though SAM is a static model, its retrained ResNet version can outper- form the ...
work page 2048
-
[5]
Experiments We used our cursor-based saliency system to col- lect mouse-movement data in 12 random videos from Hollywood-2 video saliency dataset [7] that are each 20–30 seconds long. We hired participants on Sub- jectify.us crowdsourcing platform, showed them 10 videos and paid them $0.15 if they watched all videos. In total, we collected data of 30 part...
-
[6]
Conclusion In this paper, we proposed a cheap way of get- ting high-quality saliency maps for video through the use of additional data. We developed a novel system that shows viewers videos in a mouse-contingent video player and collects mouse-tracking data approximat- ing real eye fixations. We showed that mouse-tracking data can be used as an alternative...
-
[7]
Acknowledgments This work was partially supported by the Russian Foundation for Basic Research under Grant 19-01- 00785 a
- [8]
-
[9]
T. Lu, Z. Yuan, Y. Huang, D. Wu, and H. Yu. Video retargeting with nonlinear spatial- temporal saliency fusion. In 2010 IEEE Inter- national Conference on Image Processing , pages 1801–1804, 2010
work page 2010
-
[10]
A. Borji and L. Itti. State-of-the-art in visual at- tention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(1):185– 207, 2013
work page 2013
-
[11]
Saliency prediction in the deep learning era: An empirical investigation, 2018
Ali Borji. Saliency prediction in the deep learning era: An empirical investigation, 2018
work page 2018
-
[12]
A semiauto- matic saliency model and its application to video compression
Vitaliy Lyudvichenko, Mikhail Erofeev, Yury Gitman, and Dmitriy Vatolin. A semiauto- matic saliency model and its application to video compression. In 13th IEEE International Con- ference on Intelligent Computer Communication and Processing, pages 403–410, 2017
work page 2017
-
[13]
Improv- ing video compression with deep visual-attention models
Vitaliy Lyudvichenko, Mikhail Erofeev, Alexan- der Ploshkin, and Dmitriy Vatolin. Improv- ing video compression with deep visual-attention models. In 2019 International Conference on In- telligent Medicine and Image Processing , 2019
work page 2019
-
[14]
S. Mathe and C. Sminchisescu. Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence , pages 1408–1424, 2015
work page 2015
-
[15]
Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks
Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In 2015 International Conference on Computer Vision, pages 262–270, 2015
work page 2015
-
[16]
Nam Wook Kim, Zoya Bylinskii, Michelle A. Borkin, Krzysztof Z. Gajos, Aude Oliva, Fredo Durand, and Hanspeter Pfister. Bubbleview: An interface for crowdsourcing image importance maps and tracking visual attention. ACM Trans. Comput.-Hum. Interact., 24(5):36:1–36:40, 2017
work page 2017
-
[17]
A benchmark of computational models of saliency to predict human fixations
Tilke Judd, Fr´ edo Durand, and Antonio Tor- ralba. A benchmark of computational models of saliency to predict human fixations. Technical report, Computer Science and Artificial Intelli- gence Lab, Massachusetts Institute of Technol- ogy, 2012
work page 2012
-
[18]
Predicting Human Eye Fixations via an LSTM-based Saliency At- tentive Model
Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. Predicting Human Eye Fixations via an LSTM-based Saliency At- tentive Model. IEEE Transactions on Image Pro- cessing, 27(10):5142–5154, 2018
work page 2018
-
[19]
Spatio-temporal modeling and predic- tion of visual attention in graphical user inter- faces
Pingmei Xu, Yusuke Sugano, and Andreas Bulling. Spatio-temporal modeling and predic- tion of visual attention in graphical user inter- faces. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems , pages 3299–3310, 2016
work page 2016
-
[20]
Are all the frames equally important? CoRR, abs/1905.07984, 2019
Oleksii Sidorov, Marius Pedersen, Nam Wook Kim, and Sumit Shekhar. Are all the frames equally important? CoRR, abs/1905.07984, 2019
-
[21]
Revisiting video sali- ency: A large-scale benchmark and a new model
Wenguan Wang, Jianbing Shen, Fang Guo, Ming- Ming Cheng, and Ali Borji. Revisiting video sali- ency: A large-scale benchmark and a new model. 2018
work page 2018
-
[22]
Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM
Lai Jiang, Mai Xu, and Zulin Wang. Pre- dicting video saliency with object-to-motion cnn and two-layer convolutional lstm. CoRR, abs/1709.06316, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
Learning to predict where hu- mans look
Tilke Judd, Krista Ehinger, Fr´ edo Durand, and Antonio Torralba. Learning to predict where hu- mans look. In International Conference on Com- puter Vision (ICCV) , pages 2106–2113, 2009. About the authors Vitaliy Lyudvichenko is a Ph.D. student of Com- puter Graphics and Media Lab of Computer Science department of Lomonosov Moscow State University. Dmitr...
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.