AllSERP: Exhaustive Per-Element Enrichment of the Versatile AdSERP Dataset

K. Andrew Edmonds

arxiv: 2605.04949 · v2 · pith:VZP4VYJKnew · submitted 2026-05-06 · 💻 cs.IR

AllSERP: Exhaustive Per-Element Enrichment of the Versatile AdSERP Dataset

K. Andrew Edmonds This is my paper

Pith reviewed 2026-05-20 23:22 UTC · model grok-4.3

classification 💻 cs.IR

keywords AllSERPAdSERPSERPbounding boxesclick attributioneye trackinginformation retrievaluser behavior analysis

0 comments

The pith

AllSERP enriches the AdSERP corpus with pixel-accurate bounding boxes for organic and widget elements plus semantic types for thirteen categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases AllSERP as a typed area-of-interest and per-element behavioral enrichment of the existing AdSERP dataset of real Google search engine result pages. It supplies bounding boxes for non-ad elements derived from screenshots using computer vision, labels drawn from HTML parsing, and a gap-filling method that together support click attribution for 91.7 percent of the trials. The work maintains zero disagreement with the original ad rectangles on an ad-versus-non-ad split across more than 38,000 classifications. This enrichment is shipped with a reproducible pipeline, per-trial JSON files, a corpus CSV, and a browser replay viewer so that researchers can perform element-level analyses of clicks, fixations, and regressions. The release targets questions that the original ads-only bounding boxes could not address.

Core claim

AllSERP adds pixel-accurate organic and widget bounding boxes via screenshot-anchored computer vision, semantic types across thirteen element types via an HTML parser, an inter-result gap-fill flavor called typed_gapfill, and X+Y click attribution that reaches 91.7 percent of the corpus while flagging the rest at trial level, with the Phase C ad-versus-non-ad partition showing zero disagreements across 38,250 classifications.

What carries the argument

Screenshot-anchored computer vision pipeline for bounding-box generation combined with HTML parsing for semantic element typing and typed_gapfill for inter-result coverage.

If this is right

Enables per-element click, fixation, regression, and above-fold analyses that the original ads-versus-organic split could not resolve.
The ad-versus-non-ad partition remains internally consistent with the shipped ad rectangles across the entire corpus.
Provides a reproducible pipeline, per-trial JSONs, corpus CSV, and browser-based replay viewer for community use.
Supports finer-grained study of user behavior on search engine result pages before the introduction of AI Overviews.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be applied to post-AI-Overview SERPs to measure how new summary elements alter attention and click patterns.
Integration with additional eye-tracking or interaction datasets from other search engines would allow cross-platform comparisons of element-level engagement.
The typed element labels may support training or evaluation of models that predict user attention at the level of individual widgets or organic results.

Load-bearing premise

The computer vision procedure anchored to screenshots produces bounding boxes accurate enough to support reliable per-element behavioral analysis.

What would settle it

A manual audit or independent ground-truth comparison that finds systematic mismatches between the generated organic and widget bounding boxes and the actual rendered positions of those elements on the screenshots.

Figures

Figures reproduced from arXiv: 2605.04949 by K. Andrew Edmonds.

**Figure 1.** Figure 1: The four-phase pipeline applied to a synthetic SERP. (A) CV row-projection on the main column produces card spans view at source ↗

**Figure 2.** Figure 2: Replay viewer rendering of trial p010-b2-t6 (viewer view at source ↗

**Figure 3.** Figure 3: Click rate by position under the two main flavors AllSERP releases. Left: organic-only flavor — position 0 is the topmost view at source ↗

read the original abstract

We release AllSERP, a typed AOI and per-element behavioral enrichment of the AdSERP commercial-intent SERP corpus [4]. AdSERP ships 2,776 trials of full-page screenshots, captured SERP HTML, 150 Hz Gazepoint eye tracking, evtrack mouse telemetry, scroll, and pupil signals against real Google SERPs collected before AI Overviews -- but its bounding boxes cover only ad surfaces (15.5 % of attributable clicks). AllSERP adds pixel-accurate organic and widget bboxes via screenshot-anchored CV, semantic types across thirteen element types via an HTML parser, an inter-result gap-fill flavor (typed_gapfill), and X+Y click attribution that reaches 91.7 % of the corpus while flagging the rest at trial level. The Phase C ad-vs-non-ad partition is internally consistent with the shipped ad rectangles (0 disagreements across 38,250 classifications). We ship the pipeline, per-trial JSONs, a corpus CSV, and a browser-based replay viewer; everything is reproducible from the AdSERP Zenodo volume. The release enables per-element click, fixation, regression, and above-fold analyses that the shipped ads-vs-organic split could not resolve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript releases AllSERP, a typed AOI and per-element behavioral enrichment of the AdSERP corpus of 2,776 real Google SERP trials. It augments the original ad-only bounding boxes (covering 15.5% of clicks) by adding pixel-accurate organic and widget bounding boxes via screenshot-anchored computer vision, semantic types across thirteen element types via an HTML parser, a typed_gapfill inter-result method, and X+Y click attribution reaching 91.7% of the corpus (with the remainder flagged at trial level). The work reports an internal consistency check for the Phase C ad-vs-non-ad partition (zero disagreements across 38,250 classifications) and ships the full pipeline, per-trial JSONs, corpus CSV, and browser replay viewer, all reproducible from the AdSERP Zenodo volume.

Significance. If the CV-derived bounding boxes are sufficiently accurate, the release would enable valuable per-element analyses of clicks, fixations, regressions, and above-fold behavior on SERPs that the prior ads-vs-organic split could not support. The shipped artifacts, reproducibility, high attribution coverage, and direct consistency check against existing ad rectangles are clear strengths for a data-release contribution in information retrieval.

major comments (2)

[Abstract] Abstract: The central claim that AllSERP supplies 'pixel-accurate organic and widget bboxes via screenshot-anchored CV' (plus semantic types for thirteen element types) is load-bearing for all downstream per-element analyses, yet the only quantitative check described is the ad-vs-non-ad partition consistency (0 disagreements on 38,250 classifications). No IoU, pixel-offset, precision/recall, or ground-truth comparison is reported for the new CV outputs or the organic/widget boxes.
[Abstract] Abstract (Phase C description): The ad-vs-non-ad consistency check relies on the pre-existing shipped ad rectangles and therefore provides no validation for the accuracy of the newly derived CV bounding boxes, the typed_gapfill method, or the HTML-parser semantic types.

minor comments (2)

[Abstract] The abstract would be strengthened by briefly stating any available error rates or validation approach for the CV component even if full details appear later in the text.
Ensure the reference to the original AdSERP work [4] is fully expanded in the bibliography and that any new dependencies (e.g., specific CV libraries) are cited.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful and constructive review of the AllSERP manuscript. We address each major comment point by point below, clarifying the scope of our reported checks while acknowledging limitations in direct validation for the new components. Revisions have been made to improve transparency.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that AllSERP supplies 'pixel-accurate organic and widget bboxes via screenshot-anchored CV' (plus semantic types for thirteen element types) is load-bearing for all downstream per-element analyses, yet the only quantitative check described is the ad-vs-non-ad partition consistency (0 disagreements on 38,250 classifications). No IoU, pixel-offset, precision/recall, or ground-truth comparison is reported for the new CV outputs or the organic/widget boxes.

Authors: We appreciate the referee drawing attention to the validation gap. The reported consistency check (zero disagreements on 38,250 classifications) confirms that our Phase C ad/non-ad partition aligns with the pre-existing AdSERP ad rectangles, serving as an internal sanity check for the overall enrichment pipeline. However, we acknowledge that this does not provide direct quantitative accuracy metrics such as IoU, pixel offset, precision, or recall for the newly derived CV bounding boxes on organic and widget elements, nor for the semantic types. The CV method is screenshot-anchored and the semantic types are extracted via deterministic HTML parsing, but no independent ground-truth annotations were collected for these additions. We have revised the abstract to qualify the bounding-box claim as 'high-resolution' rather than 'pixel-accurate' and added a dedicated limitations subsection describing the CV and parsing approaches along with the absence of external validation metrics. The complete open-source pipeline is provided to support future community-led accuracy assessments. revision: yes
Referee: [Abstract] Abstract (Phase C description): The ad-vs-non-ad consistency check relies on the pre-existing shipped ad rectangles and therefore provides no validation for the accuracy of the newly derived CV bounding boxes, the typed_gapfill method, or the HTML-parser semantic types.

Authors: We agree that the consistency check is scoped specifically to the ad/non-ad partition against the original rectangles and therefore does not constitute validation of the CV-derived organic/widget boxes, the typed_gapfill interpolation, or the HTML-derived semantic labels. The typed_gapfill method is a rule-based procedure for filling inter-result gaps, and semantic types are produced by parsing the captured SERP HTML. We have updated the manuscript text to explicitly delimit the scope of the consistency check, expanded the methods section with pseudocode for typed_gapfill and the HTML parser, and inserted a limitations paragraph noting that accuracy of the new elements rests on the reproducibility of the released pipeline rather than on new ground-truth comparisons. We maintain that the 91.7 % click attribution rate and the release of all per-trial JSONs, CSV, and viewer still provide substantial value for per-element analyses. revision: partial

standing simulated objections not resolved

We do not possess independent ground-truth annotations for the organic and widget bounding boxes, so cannot supply IoU, precision/recall, or pixel-offset metrics without new data collection.

Circularity Check

0 steps flagged

No circularity: data release with direct consistency check only

full rationale

The manuscript is a dataset release paper that describes adding bounding boxes via screenshot-anchored CV, semantic labels via HTML parser, and typed gap-fill to the existing AdSERP corpus. The sole quantitative statement is a direct consistency check (0 disagreements on 38,250 ad-vs-non-ad classifications) that compares the new partition against the pre-existing ad rectangles shipped with the original dataset. No equations, fitted parameters, predictions, or derivations are present that could reduce to inputs by construction. The cited prior work [4] is an external corpus release, not a self-citation chain supporting a uniqueness claim or ansatz. The contribution is therefore self-contained as reproducible artifacts and a simple internal consistency verification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper. No free parameters are fitted, no mathematical axioms are invoked, and no new entities are postulated beyond standard CV and HTML parsing techniques.

pith-pipeline@v0.9.0 · 5745 in / 1161 out tokens · 30882 ms · 2026-05-20T23:22:04.834359+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AllSERP adds pixel-accurate organic and widget bboxes via screenshot-anchored CV, semantic types across thirteen element types via an HTML parser, an inter-result gap-fill flavor (typed_gapfill), and X+Y click attribution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Duchowski

Andrew T. Duchowski. 2026. Real-Time Cognitive Load Measurement of Pupil- lary Oscillation.Proceedings of the ACM on Computer Graphics and Interactive Techniques9, 2 (2026). doi:10.1145/3803537

work page doi:10.1145/3803537 2026
[2]

Wai-Tat Fu and Peter Pirolli. 2007. SNIF-ACT: A Cognitive Model of User Naviga- tion on the World Wide Web.Human-Computer Interaction22, 4 (2007), 355–412. doi:10.1080/07370020701638806

work page doi:10.1080/07370020701638806 2007
[3]

Yasith Jayawardena, Gavindya Jayawardana, and Jacek Gwizdka. 2025. Real-Time Pupillometry-based Index of Pupillary Activity (RIPA2).Journal of Eye Movement Research(2025)

work page 2025
[4]

Kayhan Latifzadeh, Jacek Gwizdka, and Luis A. Leiva. 2025. A Versatile Dataset of Mouse and Eye Movements on Search Engine Results Pages. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). ACM, Padua, Italy, 3412–3421. doi:10.1145/3726 302.3730325 Dataset on Zenodo: https://zenodo...

work page doi:10.1145/3726 2025
[5]

Leiva and Roberto Vivó

Luis A. Leiva and Roberto Vivó. 2013. Web Browsing Behavior Analysis and Interactive Hypervideo.ACM Transactions on the Web7, 4 (2013), 1–28. doi:10.114 5/2529995.2529996

work page arXiv 2013
[6]

Peter Pirolli and Stuart Card. 1999. Information Foraging.Psychological Review 106, 4 (1999), 643–675. doi:10.1037/0033-295X.106.4.643

work page doi:10.1037/0033-295x.106.4.643 1999
[7]

Leiva, and Ioannis Arapakis

Mario Villaizán-Vallelado, Matteo Salvatori, Kayhan Latifzadeh, Antonio Penta, Luis A. Leiva, and Ioannis Arapakis. 2025. AdSight: Scalable and Accurate Quantifi- cation of User Attention in Multi-Slot Sponsored Search. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). ACM, Padua...

work page doi:10.1145/3726302.3729891 2025

[1] [1]

Duchowski

Andrew T. Duchowski. 2026. Real-Time Cognitive Load Measurement of Pupil- lary Oscillation.Proceedings of the ACM on Computer Graphics and Interactive Techniques9, 2 (2026). doi:10.1145/3803537

work page doi:10.1145/3803537 2026

[2] [2]

Wai-Tat Fu and Peter Pirolli. 2007. SNIF-ACT: A Cognitive Model of User Naviga- tion on the World Wide Web.Human-Computer Interaction22, 4 (2007), 355–412. doi:10.1080/07370020701638806

work page doi:10.1080/07370020701638806 2007

[3] [3]

Yasith Jayawardena, Gavindya Jayawardana, and Jacek Gwizdka. 2025. Real-Time Pupillometry-based Index of Pupillary Activity (RIPA2).Journal of Eye Movement Research(2025)

work page 2025

[4] [4]

Kayhan Latifzadeh, Jacek Gwizdka, and Luis A. Leiva. 2025. A Versatile Dataset of Mouse and Eye Movements on Search Engine Results Pages. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). ACM, Padua, Italy, 3412–3421. doi:10.1145/3726 302.3730325 Dataset on Zenodo: https://zenodo...

work page doi:10.1145/3726 2025

[5] [5]

Leiva and Roberto Vivó

Luis A. Leiva and Roberto Vivó. 2013. Web Browsing Behavior Analysis and Interactive Hypervideo.ACM Transactions on the Web7, 4 (2013), 1–28. doi:10.114 5/2529995.2529996

work page arXiv 2013

[6] [6]

Peter Pirolli and Stuart Card. 1999. Information Foraging.Psychological Review 106, 4 (1999), 643–675. doi:10.1037/0033-295X.106.4.643

work page doi:10.1037/0033-295x.106.4.643 1999

[7] [7]

Leiva, and Ioannis Arapakis

Mario Villaizán-Vallelado, Matteo Salvatori, Kayhan Latifzadeh, Antonio Penta, Luis A. Leiva, and Ioannis Arapakis. 2025. AdSight: Scalable and Accurate Quantifi- cation of User Attention in Multi-Slot Sponsored Search. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). ACM, Padua...

work page doi:10.1145/3726302.3729891 2025