Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Fabian Baumann; Grigori Paris; Michael Paris

arxiv: 2603.15416 · v2 · submitted 2026-03-16 · ⚛️ physics.soc-ph · cs.DL· cs.IR· cs.IT· math.IT

Estimating Absolute Web Crawl Coverage From Longitudinal Set Intersections

Michael Paris , Grigori Paris , Fabian Baumann This is my paper

Pith reviewed 2026-05-15 10:20 UTC · model grok-4.3

classification ⚛️ physics.soc-ph cs.DLcs.IRcs.ITmath.IT

keywords web archivescrawl coveragelongitudinal analysisurn modeloverlap estimationGerman Academic Webfocused crawling

0 comments

The pith

Absolute web-crawl coverage can be estimated from URL overlaps between successive crawls alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to quantify how complete a web crawl is by using only the archive's own repeated crawls over time. The approach models the chance that any given URL appears in two consecutive crawls as the result of a simple urn process in which the total number of possible URLs is fixed. Linear regression on the observed overlap sizes then yields the two unknown parameters: the total size of the crawlable space and the coverage fraction. When applied to fifteen semi-annual crawls of the German academic web collected between 2013 and 2021, the model returns a stable coverage of roughly 46 percent of the crawlable URL space.

Core claim

Coverage of a focused web crawl equals the fraction of an urn's balls that are drawn in each trial; this fraction and the urn size are recovered by regressing the sizes of successive intersections against the sizes of the individual crawls.

What carries the argument

Urn process model for longitudinal URL overlaps, with coverage and space size inferred by linear regression on intersection counts.

If this is right

Coverage estimates require no external reference data.
The method applies to any sequence of focused crawls that maintain a stable configuration.
For the observed German academic web crawls the estimated coverage is approximately 46 percent.
Coverage can be tracked over time as new crawls are added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same urn-model logic could be used to compare coverage across different crawling policies without ground truth.
If the model parameters change between crawl regimes, the method could detect when a crawler has begun to miss new parts of the web.
Extending the regression to include higher-order overlaps between non-consecutive crawls might tighten the estimates.

Load-bearing premise

The observed URL overlaps between consecutive crawls are generated by a fixed-size urn process whose parameters remain constant across the observation period.

What would settle it

Running an exhaustive crawl of the same German academic web domain and comparing the resulting URL count to the model's estimate of total crawlable space would falsify the claim if the numbers differ substantially.

read the original abstract

Web archives preserve portions of the web, but quantifying their completeness remains challenging. Prior approaches have estimated the coverage of a crawl by either comparing the outcomes of multiple crawlers, or by comparing the results of a single crawl to external ground truth datasets. We propose a method to estimate the absolute coverage of a crawl using only the archive's own longitudinal data, i.e., the data collected by multiple subsequent crawls. Our key insight is that coverage can be estimated from the empirical URL overlaps between subsequent crawls, which are in turn well described by a simple urn process. The parameters of the urn model can then be inferred from longitudinal crawl data using linear regression. Applied to our focused crawl configuration of the German Academic Web, with 15 semi-annual crawls between 2013-2021, we find a coverage of approximately 46 percent of the crawlable URL space for the stable crawl configuration regime. Our method is extremely simple, requires no external ground truth, and generalizes to any longitudinal focused crawl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a simple urn-regression method to back out absolute coverage from longitudinal crawl overlaps alone, but the random-sampling assumption looks mismatched to how focused crawlers actually work.

read the letter

The paper's main contribution is a method for estimating the absolute coverage of a web crawl using only the overlaps between successive crawls in a longitudinal series. They model those overlaps as the result of an urn process and recover the coverage fraction by linear regression on the intersection sizes. On their 15 semi-annual crawls of the German academic web they report about 46 percent coverage for the stable configuration. This approach is new in the way it avoids both multi-crawler comparisons and external ground truth. It is simple enough that anyone with repeated crawls of the same target could apply it with basic statistics. The authors are clear that it applies to focused crawls under stable policies, which keeps the claim scoped. The main concern is the match between the urn model and real crawl behavior. A focused crawler discovers pages through links rather than uniform random selection, so the probability that a URL appears in consecutive crawls is not independent across URLs. This can produce different overlap patterns than the urn predicts. The paper does not appear to show residual plots or tests against the urn null, nor does it check whether the same coverage estimate holds when the crawl policy changes. There is also the question of page birth and death between the six-month intervals, which the model treats as fixed. For readers who maintain focused archives, this could be a useful internal diagnostic if the fit can be validated on their data. It is not yet ready to replace ground-truth checks, but it is worth testing. I would send it to peer review because the idea is practical and the limitations are straightforward to investigate.

Referee Report

4 major / 2 minor

Summary. The manuscript proposes a method to estimate absolute web crawl coverage using only longitudinal data from multiple subsequent crawls. It models empirical URL overlaps between crawls as a simple urn process, infers the urn parameters via linear regression on the overlap counts, and applies the approach to 15 semi-annual crawls of the German Academic Web (2013-2021) to report approximately 46% coverage of the crawlable URL space in the stable configuration regime. The method is presented as simple, ground-truth-free, and generalizable to any longitudinal focused crawl.

Significance. If the urn model and regression recovery are shown to be reliable, the technique would offer a practical, self-contained way to quantify archive completeness from internal data alone. This could be valuable for web preservation studies, enabling coverage assessment without external comparators or ground-truth datasets.

major comments (4)

[Abstract] Abstract: the claim that overlaps are 'well described by a simple urn process' and that regression recovers coverage supplies no fit statistics, residual analysis, sensitivity checks, or validation against known coverage; the 46% figure is presented without supporting quantitative evidence.
[Method] Method (urn model and regression): coverage is recovered by fitting urn parameters directly to the same longitudinal overlap counts later used to compute coverage, so the estimate is defined in terms of quantities fitted to the target data; this circularity requires explicit testing via simulation or external validation.
[Results] Results (German Academic Web application): the single scalar coverage for the 'stable crawl configuration regime' is reported without checks that regression residuals are consistent with the urn null or that the same coverage is recovered under deliberate policy variations.
[Discussion] Discussion (model assumptions): the urn process assumes random sampling of URLs, yet focused crawls traverse link graphs with depth limits, domain prioritization, and politeness rules; no analysis tests whether these correlations (e.g., entire subtrees discovered together) or net page birth/death over six-month intervals bias the inferred coverage parameter.

minor comments (2)

[Abstract] Clarify the precise definition of 'crawlable URL space' and how it is operationalized in the urn model.
[Figures] Add axis labels, legends, and error bars to any regression or overlap plots for readability.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, clarifying our approach where possible and indicating revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that overlaps are 'well described by a simple urn process' and that regression recovers coverage supplies no fit statistics, residual analysis, sensitivity checks, or validation against known coverage; the 46% figure is presented without supporting quantitative evidence.

Authors: We agree that quantitative support for the model fit strengthens the claims. In the revised manuscript we will add R-squared values and residual diagnostics for the linear regressions, plus sensitivity analyses on parameter stability. We will also include a simulation study that generates synthetic longitudinal overlaps under known coverage and recovers the input parameters via the same regression procedure. revision: yes
Referee: [Method] Method (urn model and regression): coverage is recovered by fitting urn parameters directly to the same longitudinal overlap counts later used to compute coverage, so the estimate is defined in terms of quantities fitted to the target data; this circularity requires explicit testing via simulation or external validation.

Authors: The regression estimates urn parameters (including the unseen total) from observed overlaps; the coverage figure is then a derived quantity. This is standard statistical inference rather than circularity, but we accept the need for explicit validation. The revision will add Monte-Carlo simulations that inject known coverage values, generate overlap counts, and confirm accurate recovery of the injected parameters. revision: yes
Referee: [Results] Results (German Academic Web application): the single scalar coverage for the 'stable crawl configuration regime' is reported without checks that regression residuals are consistent with the urn null or that the same coverage is recovered under deliberate policy variations.

Authors: We will incorporate residual plots and formal checks against the urn-model null in the results section. Deliberate policy variations are not feasible with the existing historical dataset, which was collected under a single fixed crawl policy; however, we will report coverage estimates across multiple sliding time windows to demonstrate stability within the observed data. revision: partial
Referee: [Discussion] Discussion (model assumptions): the urn process assumes random sampling of URLs, yet focused crawls traverse link graphs with depth limits, domain prioritization, and politeness rules; no analysis tests whether these correlations (e.g., entire subtrees discovered together) or net page birth/death over six-month intervals bias the inferred coverage parameter.

Authors: The urn model is indeed an independence approximation. The revised discussion will explicitly acknowledge potential biases arising from link-graph correlations, prioritization rules, and net URL turnover between semi-annual crawls. We will frame the 46% figure as an estimate conditional on the model assumptions and outline how richer crawl metadata could allow future extensions that relax independence. revision: yes

Circularity Check

1 steps flagged

Coverage recovered by fitting urn parameters directly to the same longitudinal overlap counts

specific steps

fitted input called prediction [Abstract]
"Our key insight is that coverage can be estimated from the empirical URL overlaps between subsequent crawls, which are in turn well described by a simple urn process. The parameters of the urn model can then be inferred from longitudinal crawl data using linear regression. Applied to our focused crawl configuration of the German Academic Web, with 15 semi-annual crawls between 2013-2021, we find a coverage of approximately 46 percent of the crawlable URL space for the stable crawl configuration regime."

The coverage fraction is a parameter of the urn model; its value is recovered by regressing the model on the exact overlap counts later used to report the coverage. The 46% figure is therefore the fitted parameter rather than a prediction generated from independent inputs.

full rationale

The paper's central estimate is obtained by inferring the urn model's coverage parameter via linear regression on the observed intersection sizes from the 15 crawls. This makes the reported 46% a fitted value from the target data under the assumed functional form rather than an independent derivation. The derivation chain therefore reduces to a fit on the inputs used to compute the result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption (overlaps follow a simple urn process) whose parameters are fitted to the observed data; no new physical entities are introduced.

free parameters (1)

urn process parameters
Parameters of the urn model are inferred via linear regression on the empirical overlaps from the 15 crawls.

axioms (1)

domain assumption URL overlaps between subsequent crawls are well described by a simple urn process
Stated as the key insight enabling inference of absolute coverage from longitudinal data.

pith-pipeline@v0.9.0 · 5482 in / 1242 out tokens · 45216 ms · 2026-05-15T10:20:27.420742+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our key insight is that coverage can be estimated from the empirical URL overlaps between subsequent crawls, which are in turn well described by a simple urn process... f(T) = c α^{T-1}
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the expected size of the overlap... E[|u1 ∩ uT|] = M²/N α^{T-1}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.