Following the Eye-Tracking Evidence: Established Web-Search Assumptions Fail in Carousel Interfaces
Pith reviewed 2026-05-09 23:03 UTC · model grok-4.3
The pith
Eye-tracking data from carousel interfaces shows that web-search assumptions about scanning patterns and click behavior do not hold.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The analysis of eye-tracking recordings reveals that the F-pattern applies only to vertical examination and not to horizontal swiping; conditioned on a click, examination traces an L-pattern specific to carousels; the examination hypothesis fails to predict which items receive clicks; and users ignore carousel headings while focusing immediately on the displayed items. These patterns contradict the assumptions imported from single-list web search.
What carries the argument
Comparison of gaze sequences and click logs recorded during controlled carousel browsing against the F-pattern and examination hypothesis imported from web-search studies.
If this is right
- Click models that assume examination precedes and causes clicks must be rebuilt for carousel settings.
- Offline evaluation metrics that embed position bias from web-search lists will misrank items in carousels.
- Interface designs that place important information in headings are likely to be overlooked by users.
- Behavioral models for recommendation need separate parameters for vertical and horizontal examination.
Where Pith is reading between the lines
- Designers could test whether separating vertical and horizontal interaction logs improves personalization accuracy in live recommendation systems.
- The L-pattern finding raises the question of whether similar gaze shapes appear in other swipe-based mobile interfaces such as social feeds.
Load-bearing premise
The eye-tracking dataset accurately records representative examination and clicking behavior without artifacts from the laboratory setup or participant pool that would create the observed L-pattern or hypothesis failures.
What would settle it
A new eye-tracking study on a different carousel interface that records the classic F-pattern across both directions or finds that examined items receive clicks at rates predicted by the examination hypothesis would contradict the central claims.
Figures
read the original abstract
Carousel interfaces have been the de-facto standard for streaming media services for over a decade. Yet, there has been very little research into user behavior with such interfaces, which thus remains poorly understood. Due to this lack of empirical research, previous work has assumed that behaviors established in single-list web-search interfaces, such as the F-pattern and the examination hypothesis, also apply to carousel interfaces, for instance when designing click models or evaluation metrics. We analyze a recently-released interaction and examination dataset resulting from an eye-tracking study performed on carousel interfaces to verify whether these assumptions actually hold. We find that (i)~the F-pattern holds only for vertical examination and not for horizontal swiping; additionally, we discover that, when conditioned on a click, user examination follows an L-pattern unique to carousel interfaces; (ii)~click-through-rates conditioned on examination indicate that the well-known examination hypothesis does not hold in carousel interfaces; and (iii)~contrary to the assumptions of previous work, users generally ignore carousel headings and focus directly on the content items. Our findings show that many user behavior assumptions, especially concerning examination patterns, do not transfer from web search interfaces to carousel recommendation settings. Our work shows that the field lacks a reliable foundation on which to build models of user behavior with these interfaces. Consequently, a re-evaluation of existing metrics and click models for carousel interfaces may be warranted.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes a recently-released eye-tracking dataset collected on carousel interfaces to test transfer of web-search user behavior assumptions, including the F-pattern, examination hypothesis, and attention to headings. The authors report that the F-pattern holds only for vertical (not horizontal) examination, an L-pattern emerges when examination is conditioned on clicks, click-through rates conditioned on examination do not support the examination hypothesis, and users largely ignore headings in favor of content items. They conclude that these assumptions fail to transfer and that metrics and click models for carousels require re-evaluation.
Significance. If the empirical patterns are robust, the work is significant for information retrieval and recommender systems because carousel interfaces dominate streaming and recommendation platforms yet lack dedicated behavioral models. The eye-tracking methodology provides direct evidence of examination behavior beyond click logs, strengthening the case against direct transfer of web-search findings. This could motivate interface-specific click models and evaluation metrics.
major comments (3)
- [§4.2] §4.2 (Examination hypothesis analysis): The claim that CTR conditioned on examination rejects the examination hypothesis does not report the number of users, total examinations, or statistical tests (e.g., regression coefficients or p-values). Without these, it is impossible to assess whether the null result reflects a true interface difference or insufficient power, directly affecting the load-bearing conclusion that the hypothesis fails.
- [§4.1] §4.1 (L-pattern result): The L-pattern (examination conditioned on click) is presented as unique to carousels, but the section provides no operational definition of 'examination' (fixation duration threshold or AOI boundaries), no count of conditioned clicks, and no comparison to a within-study web-search baseline. These omissions make it difficult to rule out measurement artifacts or task differences as drivers of the reported pattern.
- [§3] §3 (Dataset and methods): Participant count, demographics, task instructions, exclusion criteria, and eye-tracking preprocessing steps (e.g., fixation detection parameters) are not detailed. Because all central claims rest on patterns extracted from this single dataset, missing methodological specifics prevent evaluation of whether study design choices artifactually produce the L-pattern or hypothesis failures.
minor comments (2)
- [Abstract] The abstract refers to 'a recently-released' dataset without a citation; adding the reference in §3 or the introduction would improve traceability.
- [Figures] Figure captions for the L-pattern and F-pattern visualizations could explicitly state the conditioning (e.g., 'conditioned on click') and axis scales to aid interpretation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below. Where the concerns identify gaps in reporting or clarity, we have revised the manuscript to incorporate the requested details and strengthen the presentation of our results.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Examination hypothesis analysis): The claim that CTR conditioned on examination rejects the examination hypothesis does not report the number of users, total examinations, or statistical tests (e.g., regression coefficients or p-values). Without these, it is impossible to assess whether the null result reflects a true interface difference or insufficient power, directly affecting the load-bearing conclusion that the hypothesis fails.
Authors: We agree that explicit reporting of sample sizes, examination counts, and statistical tests is necessary to allow readers to evaluate the strength of the evidence against the examination hypothesis. In the revised manuscript we will state the number of users and total examinations drawn from the public dataset, and we will add a statistical test (logistic regression or equivalent) with coefficients and p-values to quantify the relationship between examination and click probability. This will directly address concerns about statistical power and support the claim that the hypothesis does not transfer to carousel interfaces. revision: yes
-
Referee: [§4.1] §4.1 (L-pattern result): The L-pattern (examination conditioned on click) is presented as unique to carousels, but the section provides no operational definition of 'examination' (fixation duration threshold or AOI boundaries), no count of conditioned clicks, and no comparison to a within-study web-search baseline. These omissions make it difficult to rule out measurement artifacts or task differences as drivers of the reported pattern.
Authors: We accept that the current version of §4.1 lacks an explicit operational definition and supporting counts. In revision we will define examination precisely (including fixation-duration threshold and AOI boundaries) and report the number of clicks conditioned on examination. Because the study collected data exclusively on carousel interfaces, a within-study web-search baseline is unavailable; however, we will add a comparison to established F-pattern results from prior web-search eye-tracking literature to better substantiate the claim that the observed L-pattern is interface-specific rather than an artifact. revision: yes
-
Referee: [§3] §3 (Dataset and methods): Participant count, demographics, task instructions, exclusion criteria, and eye-tracking preprocessing steps (e.g., fixation detection parameters) are not detailed. Because all central claims rest on patterns extracted from this single dataset, missing methodological specifics prevent evaluation of whether study design choices artifactually produce the L-pattern or hypothesis failures.
Authors: We acknowledge that the methods section is currently brief. Although the underlying dataset is publicly released and its original paper contains the full protocol, we agree that the present manuscript should be self-contained. In the revised version we will expand §3 to summarize participant count and demographics, task instructions, exclusion criteria, and key preprocessing parameters such as fixation-detection thresholds. This will enable readers to assess potential design artifacts without needing to consult the dataset paper. revision: yes
Circularity Check
No circularity: empirical analysis of independent external dataset
full rationale
The paper conducts a direct empirical analysis of a recently-released eye-tracking dataset to test transfer of web-search assumptions (F-pattern, examination hypothesis, heading attention) to carousel interfaces. No equations, fitted parameters, derivations, or self-citations appear in the provided text or abstract. Claims reduce solely to observed patterns in the external data rather than any self-referential construction, renaming, or load-bearing prior work by the authors. This is a standard observational study whose central findings are falsifiable against the dataset itself and carry no internal circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Eye-tracking data provides a valid and unbiased measure of user examination behavior in digital interfaces.
Reference graph
Works this paper leans on
-
[1]
Aman Agarwal, Xuanhui Wang, Cheng Li, Michael Bendersky, and Marc Najork
-
[2]
Keyphrase Extraction from Disaster-related Tweets , booktitle =
Addressing Trust Bias for Unbiased Learning-to-Rank. InThe World Wide Web Conference(San Francisco, CA, USA)(WWW ’19). Association for Computing Machinery, New York, NY, USA, 4–14. doi:10.1145/3308558.3313697
-
[3]
Walid Bendada, Guillaume Salha, and Théo Bontempelli. 2020. Carousel Person- alization in Music Streaming Apps with Contextual Bandits. InProceedings of the 14th ACM Conference on Recommender Systems(Virtual Event, Brazil)(Rec- Sys ’20). Association for Computing Machinery, New York, NY, USA, 420–425. doi:10.1145/3383313.3412217
-
[4]
Olivier Chapelle and Ya Zhang. 2009. A Dynamic Bayesian Network Click Model for Web Search Ranking. InProceedings of the 18th International Conference on World Wide Web(Madrid, Spain)(WWW ’09). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1145/1526709.1526711
-
[5]
Flavio Chierichetti, Ravi Kumar, and Prabhakar Raghavan. 2011. Optimizing Two-dimensional Search Results Presentation. InProceedings of the Fourth ACM International Conference on Web Search and Data Mining(Hong Kong, China) (WSDM ’11). Association for Computing Machinery, New York, NY, USA, 257–266. doi:10.1145/1935826.1935873
-
[6]
2015.Click Models for Web Search
Aleksandr Chuklin, Ilya Markov, and Maarten Rijke. 2015.Click Models for Web Search. Springer Cham. doi:10.1007/978-3-031-02294-4
-
[7]
Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An Exper- imental Comparison of Click Position-bias Models. InProceedings of the 2008 International Conference on Web Search and Data Mining(Palo Alto, California, USA)(WSDM ’08). Association for Computing Machinery, New York, NY, USA, 87–94. doi:10.1145/1341531.1341545
-
[8]
Santiago de Leon-Martinez, Jingwei Kang, Robert Moro, Maarten de Rijke, Branislav Kveton, Harrie Oosterhuis, and Maria Bielikova. 2025. RecGaze: The First Eye Tracking and User Interaction Dataset for Carousel Interfaces. InPro- ceedings of the 48th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval(Padua, Italy)(SIG...
-
[9]
Santiago de Leon-Martinez, Robert Moro, Branislav Kveton, and Maria Bielikova
-
[10]
InProceedings of the 31st International Conference on Intelligent User Interfaces (IUI ’26)
Riding the Carousel: The First Extensive Eye Tracking Analysis of Browsing Behavior in Carousel Recommenders. InProceedings of the 31st International Conference on Intelligent User Interfaces (IUI ’26). Association for Computing Machinery, New York, NY, USA, 2120–2130. doi:10.1145/3742413.3789166
-
[11]
Dupret and Benjamin Piwowarski
Georges E. Dupret and Benjamin Piwowarski. 2008. A User Browsing Model to Predict Search Engine Click Data from Past Observations. InProceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(Singapore, Singapore)(SIGIR ’08). Association for Comput- ing Machinery, New York, NY, USA, 331–338. doi:...
-
[12]
Nicolò Felicioni, Maurizio Ferrari Dacrema, and Paolo Cremonesi. 2021. Measur- ing the User Satisfaction in a Recommendation Interface with Multiple Carousels. InProceedings of the 2021 ACM International Conference on Interactive Media Ex- periences(Virtual Event, USA)(IMX ’21). Association for Computing Machinery, New York, NY, USA, 212–217. doi:10.1145/...
-
[13]
Nicolò Felicioni, Maurizio Ferrari Dacrema, and Paolo Cremonesi. 2021. A Methodology for the Offline Evaluation of Recommender Systems in a User Interface with Multiple Carousels. InAdjunct Proceedings of the 29th ACM Con- ference on User Modeling, Adaptation and Personalization(Utrecht, Netherlands) (UMAP ’21). Association for Computing Machinery, New Yo...
-
[14]
Maurizio Ferrari Dacrema, Nicolò Felicioni, and Paolo Cremonesi. 2022. Offline Evaluation of Recommender Systems in a User Interface With Multiple Carousels. Frontiers in Big DataVolume 5 - 2022 (2022). doi:10.3389/fdata.2022.910030
-
[15]
Laura Granka, Matthew Feusner, and Lori Lorigo. 2008.Eye Monitoring in Online Search. Springer Berlin Heidelberg, Berlin, Heidelberg, 347–372. doi:10.1007/978- 3-540-75412-1_16
-
[16]
Granka, Thorsten Joachims, and Geri Gay
Laura A. Granka, Thorsten Joachims, and Geri Gay. 2004. Eye-tracking Analy- sis of User Behavior in WWW Search(SIGIR ’04). Association for Computing Machinery, New York, NY, USA, 478–479. doi:10.1145/1008992.1009079
-
[17]
Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang, and Christos Faloutsos. 2009. Click Chain Model in Web Search. InProceedings of the 18th International Conference on World Wide Web(Madrid, Spain)(WWW ’09). Association for Computing Machinery, New York, NY, USA, 11–20. doi:10. 1145/1526709.1526712
-
[18]
Fan Guo, Chao Liu, and Yi Min Wang. 2009. Efficient Multiple-click Models in Web Search. InProceedings of the Second ACM International Conference on Web Search and Data Mining(Barcelona, Spain)(WSDM ’09). Association for Computing Machinery, New York, NY, USA, 124–131. doi:10.1145/1498759.1498818
-
[19]
Kalervo Järvelin and Jaana Kekäläinen. 2000. IR Evaluation Methods for Retriev- ing Highly Relevant Documents. InProceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Athens, Greece)(SIGIR ’00). Association for Computing Machinery, New York, NY, USA, 41–48. doi:10.1145/345508.345545 Followi...
- [20]
-
[21]
Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay
-
[22]
Accurately Interpreting Clickthrough Data as Implicit Feedback. InProceed- ings of the 28th Annual International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval(Salvador, Brazil)(SIGIR ’05). Association for Com- puting Machinery, New York, NY, USA, 154–161. doi:10.1145/1076034.1076063
-
[23]
Yvonne Kammerer and Peter Gerjets. 2010. How the Interface Design Influences Users’ Spontaneous Trustworthiness Evaluations of Web Search Results: Com- paring a List and a Grid Interface. InProceedings of the 2010 Symposium on Eye- Tracking Research & Applications(Austin, Texas)(ETRA ’10). Association for Com- puting Machinery, New York, NY, USA, 299–306....
-
[24]
Yvonne Kammerer and Peter Gerjets. 2014. The Role of Search Result Position and Source Trustworthiness in the Selection of Web Search Results When Using a List or a Grid Interface.International Journal of Human–Computer Interaction 30, 3 (2014), 177–191. doi:10.1080/10447318.2013.846790
-
[25]
Youngho Kim, Ahmed Hassan, Ryen W. White, and Imed Zitouni. 2014. Compar- ing Client and Server Dwell Time Estimates for Click-level Satisfaction Predic- tion. InProceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval(Gold Coast, Queensland, Australia)(SI- GIR ’14). Association for Computing Machinery,...
-
[26]
Youngho Kim, Ahmed Hassan, Ryen W. White, and Imed Zitouni. 2014. Modeling Dwell Time to Predict Click-level Satisfaction. InProceedings of the 7th ACM International Conference on Web Search and Data Mining(New York, New York, USA)(WSDM ’14). Association for Computing Machinery, New York, NY, USA, 193–202. doi:10.1145/2556195.2556220
-
[27]
Benedikt Loepp. 2023. Multi-list interfaces for recommender systems: survey and future directions.Frontiers in Big Data6 (2023). doi:10.3389/fdata.2023.1239705
-
[28]
Benedikt Loepp and Jürgen Ziegler. 2023. How Users Ride the Carousel: Exploring the Design of Multi-List Recommender Interfaces From a User Perspective. In Proceedings of the 17th ACM Conference on Recommender Systems(Singapore, Singapore)(RecSys ’23). Association for Computing Machinery, New York, NY, USA, 1090–1095. doi:10.1145/3604915.3610638
-
[29]
Lori Lorigo, Bing Pan, Helene Hembrooke, Thorsten Joachims, Laura Granka, and Geri Gay. 2006. The Influence of Task and Gender on Search and Evaluation Behavior using Google.Information Processing & Management42, 4 (2006), 1123–
work page 2006
-
[30]
doi:10.1016/j.ipm.2005.10.001
-
[31]
Behnam Rahdari and Peter Brusilovsky. 2025. Under the Hood of Carousels: Investigating User Engagement and Navigation Effort in Multi-list Recommender Systems. InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25). Association for Computing Machinery, New York, NY, USA, 1485–1498. doi:10.1145/3708359.3712130
-
[32]
Behnam Rahdari, Peter Brusilovsky, and Branislav Kveton. 2024. Towards Simulation-Based Evaluation of Recommender Systems with Carousel Inter- faces.ACM Trans. Recomm. Syst.2, 1, Article 9 (March 2024), 25 pages. doi:10.1145/3643709
-
[33]
Behnam Rahdari, Branislav Kveton, and Peter Brusilovsky. 2022. The Magic of Carousels: Single vs. Multi-List Recommender Systems. InProceedings of the 33rd ACM Conference on Hypertext and Social Media(Barcelona, Spain)(HT ’22). Association for Computing Machinery, New York, NY, USA, 166–174. doi:10. 1145/3511095.3531278
-
[34]
Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting Clicks: Estimating the Click-through Rate for New Ads. InProceedings of the 16th International Conference on World Wide Web(Banff, Alberta, Canada)(WWW ’07). Association for Computing Machinery, New York, NY, USA, 521–530. doi:10. 1145/1242572.1242643
-
[35]
Christina Siu and Barbara S. Chaparro. 2014. First Look: Examining the Horizontal Grid Layout using Eye-tracking.Proceedings of the Human Factors and Ergonomics Society Annual Meeting58, 1 (2014), 1119–1123. doi:10.1177/1541931214581234
-
[36]
Ali Vardasbi, Harrie Oosterhuis, and Maarten de Rijke. 2020. When Inverse Propensity Scoring does not Work: Affine Corrections for Unbiased Learning to Rank. InProceedings of the 29th ACM International Conference on Information & Knowledge Management(Virtual Event, Ireland)(CIKM ’20). Association for Computing Machinery, New York, NY, USA, 1475–1484. doi:...
-
[37]
Chao Wang, Yiqun Liu, Meng Wang, Ke Zhou, Jian-yun Nie, and Shaoping Ma
-
[38]
Incorporating Non-sequential Behavior into Click Models. InProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval(Santiago, Chile)(SIGIR ’15). Association for Computing Machinery, New York, NY, USA, 283–292. doi:10.1145/2766462.2767712
-
[39]
Chao-Yuan Wu, Christopher V. Alvino, Alexander J. Smola, and Justin Basilico
-
[40]
Using Navigation to Improve Recommendations in Real-Time. InProceed- ings of the 10th ACM Conference on Recommender Systems(Boston, Massachusetts, USA)(RecSys ’16). Association for Computing Machinery, New York, NY, USA, 341–348. doi:10.1145/2959100.2959174
-
[41]
Xiaohui Xie, Yiqun Liu, Xiaochuan Wang, Meng Wang, Zhijing Wu, Yingying Wu, Min Zhang, and Shaoping Ma. 2017. Investigating Examination Behavior of Image Search Users. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval(Shinjuku, Tokyo, Japan) (SIGIR ’17). Association for Computing Machinery, N...
-
[42]
Xiaohui Xie, Jiaxin Mao, Maarten de Rijke, Ruizhe Zhang, Min Zhang, and Shaop- ing Ma. 2018. Constructing an Interaction Behavior Model for Web Image Search. InThe 41st International ACM SIGIR Conference on Research & Development in In- formation Retrieval(Ann Arbor, MI, USA)(SIGIR ’18). Association for Computing Machinery, New York, NY, USA, 425–434. doi...
-
[43]
Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Maarten de Rijke, Yunqiu Shao, Zixin Ye, Min Zhang, and Shaoping Ma. 2019. Grid-based Evaluation Metrics for Web Image Search. InThe World Wide Web Conference(San Francisco, CA, USA)(WWW ’19). Association for Computing Machinery, New York, NY, USA, 2103–2114. doi:10.1145/3308558.3313514
-
[44]
Danqing Xu, Yiqun Liu, Min Zhang, Shaoping Ma, and Liyun Ru. 2012. Incorpo- rating Revisiting Behaviors into Click Models. InProceedings of the Fifth ACM International Conference on Web Search and Data Mining(Seattle, Washington, USA)(WSDM ’12). Association for Computing Machinery, New York, NY, USA, 303–312. doi:10.1145/2124295.2124334
-
[45]
Qian Zhao, Shuo Chang, F. Maxwell Harper, and Joseph A. Konstan. 2016. Gaze Prediction for Recommender Systems. InProceedings of the 10th ACM Conference on Recommender Systems(Boston, Massachusetts, USA)(RecSys ’16). Association for Computing Machinery, New York, NY, USA, 131–138. doi:10.1145/2959100. 2959150
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.