arxiv: 2604.07585 · v1 · submitted 2026-04-08 · 💻 cs.IR · cs.AI

Don't Measure Once: Measuring Visibility in AI Search (GEO)

Julius Schulte , Malte Bleeker , Philipp Kaufmann This is my paper

Pith reviewed 2026-05-10 17:03 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords generative engine optimizationAI search visibilityprobabilistic measurementrepeated samplinginformation retrievalGEOoutput variability

0 comments

The pith

Visibility in AI search must be assessed through repeated measurements because single queries produce unreliable snapshots due to output variability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that generative engine optimization requires treating visibility as a distribution obtained from multiple queries rather than a single observation. Traditional search engines deliver stable results where one query suffices, but AI chat systems vary their answers across runs, prompts, and time. This probabilistic behavior makes one-off checks inadequate for determining where a brand or page ranks relative to competitors. The authors support this by referencing empirical studies that show inconsistent outputs, leading to the conclusion that GEO performance evaluation needs repeated sampling. If correct, this changes how practitioners and researchers measure and optimize content for AI-driven information access.

Core claim

The inherent probabilistic nature of large language model-based chat systems changes the paradigm of visibility measurement. Answers can vary across runs, prompts, and time, making one-off observations unreliable. Drawing on empirical studies, visibility in generative engine optimization must be characterized as a distribution rather than a single-point outcome to accurately assess a brand's performance.

What carries the argument

Modeling visibility as a distribution across repeated queries to capture the probabilistic variation in AI search outputs.

If this is right

GEO assessments must incorporate multiple samples to produce reliable rankings and visibility scores.
Metrics for AI search should include statistical measures such as variance or probability ranges instead of fixed positions.
Optimization efforts need to account for inconsistency in model responses rather than targeting a single expected output.
Evaluation protocols in information retrieval for generative systems require new repeated-measurement standards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standard SEO reporting tools may need to evolve into dashboards that display visibility distributions with confidence intervals.
The approach could extend to other stochastic retrieval systems where output consistency is low.
Businesses optimizing for AI search might adopt ongoing monitoring schedules instead of periodic single checks.

Load-bearing premise

The variability seen in AI search outputs is large enough and consistent enough that single-point measurements become practically unusable for assessing GEO performance.

What would settle it

An experiment that repeatedly queries the same set of prompts and finds negligible variation in visibility outcomes across runs would show that single measurements are sufficient.

Figures

Figures reproduced from arXiv: 2604.07585 by Julius Schulte, Malte Bleeker, Philipp Kaufmann.

**Figure 2.** Figure 2: Day-to-day brand similarity (Jaccard, left; RBO, right) for the three campaigns meeting the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Source citation Gini coefficient by campaign and engine, Jan 24 – Mar 20, 2026. Values close to 1.0 indicate [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Pairwise Jaccard and RBO for sources (top row) and brands (bottom row) across repeated runs within a [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Per-prompt mean Jaccard (dark) and RBO (light) for source similarity (top row, all campaigns) and brand [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Subsampling standard error of the estimated per-brand detection rate (left) and source-coverage Jaccard (right) [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Mean standard error of the d-day rolling window per-brand detection rate (averaged across 1,726 per-brand series) as a function of window length d. Left: linear scale; right: log scale. The red dashed line marks SE = 0.05; the purple dashed line marks SE = 0.02. SE falls below 0.05 at d = 24 days and below 0.02 at d = 34 days [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

As large language model-based chat systems become increasingly widely used, generative engine optimization (GEO) has emerged as an important problem for information access and retrieval. In classical search engines, results are comparatively transparent and stable: a single query often provides a representative snapshot of where a page or brand appears relative to competitors. The inherent probabilistic nature of AI search changes this paradigm. Answers can vary across runs, prompts, and time, making one-off observations unreliable. Drawing on empirical studies, our findings underscore the need for repeated measurements to assess a brand's GEO performance and to characterize visibility as a distribution rather than a single-point outcome.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a genuine practical problem with single-shot GEO measurements due to LLM output variability, but it offers no numbers on how large or consistent that variability actually is.

read the letter

The main thing to know is that single queries are unreliable for assessing brand visibility in AI chat systems because outputs shift across runs, prompts, and time, so the field should move to distributional measurement instead. This is a straightforward application of established LLM stochasticity to the GEO setting, and the contrast with stable classical search results is useful for anyone starting to evaluate these systems. The paper does a decent job of stating the implication clearly: visibility should be characterized as a distribution rather than a point estimate, and it nods to empirical studies as support. That reminder alone could help practitioners avoid over-interpreting one-off results. The soft spot is the lack of any quantification. The abstract invokes empirical studies but gives no sample sizes, no run-to-run standard deviations on citation rates or positions, no comparison against classical search variance, and no threshold for when single measurements become unusable. Without those details the claim stays at the level of a reasonable caution rather than a demonstrated necessity. If the full paper includes the actual data and methods behind the referenced studies, that would close the gap; as described, the argument rests on an untested assertion about magnitude. This is for people working on GEO evaluation or LLM-based retrieval metrics who need a quick prompt to think about repeated sampling. A reader already familiar with LLM variability will find little new, but newcomers to measurement in this area could pick up the basic point. It deserves peer review once the quantitative backing is added, because the underlying issue is real and timely even if the current version is light on evidence.

Referee Report

1 major / 0 minor

Summary. The manuscript claims that unlike classical search engines, where a single query provides a stable snapshot of visibility, AI search (generative engine optimization or GEO) is inherently probabilistic. LLM outputs vary across runs, prompts, and time, rendering one-off measurements unreliable; visibility must instead be assessed as a distribution via repeated measurements. The argument draws on unspecified empirical studies.

Significance. If substantiated with quantitative evidence that variability is large and consistent enough to make single measurements unusable in practice, the result would meaningfully shift evaluation practices in information retrieval and digital marketing, requiring statistical rather than deterministic approaches to GEO assessment.

major comments (1)

[Abstract] Abstract: the central claim that 'one-off observations [are] unreliable' rests entirely on a reference to 'empirical studies' with no accompanying methods, sample sizes, quantitative results (e.g., run-to-run standard deviation of citation frequency, position, or sentiment scores), comparison to classical search variance, or decision threshold for when single measurements fail. This is load-bearing because the paper's thesis is that the observed variability is practically large enough to necessitate repeated measurements; without these data the assertion remains untested.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive critique. The feedback correctly identifies a gap in substantiation that we will address through revision.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'one-off observations [are] unreliable' rests entirely on a reference to 'empirical studies' with no accompanying methods, sample sizes, quantitative results (e.g., run-to-run standard deviation of citation frequency, position, or sentiment scores), comparison to classical search variance, or decision threshold for when single measurements fail. This is load-bearing because the paper's thesis is that the observed variability is practically large enough to necessitate repeated measurements; without these data the assertion remains untested.

Authors: We agree that the abstract, as written, does not supply the quantitative details needed to make the central claim self-contained and testable. The manuscript references empirical observations of output variability but does not report methods, sample sizes, or specific statistics in the abstract or with sufficient explicitness elsewhere. In the revised manuscript we will (1) expand the abstract to include concise quantitative summaries (e.g., number of queries, runs per query, observed standard deviations in citation frequency and position), (2) add a dedicated methods/results subsection that reports the experimental protocol, variance measurements, a direct comparison to the stability of classical search rankings, and a discussion of practical thresholds at which single measurements become unreliable, and (3) ensure all claims are tied to these data rather than left as an unspecified reference. These changes will be made in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: argument rests on cited empirical observations without self-referential reduction

full rationale

The paper advances a conceptual claim that probabilistic variation in LLM outputs renders single GEO visibility measurements unreliable, supported by reference to external empirical studies. No equations, derivations, fitted parameters, or self-citations appear in the provided text. The central assertion does not reduce to a definition, fit, or prior author result by construction; it functions as an interpretive recommendation grounded in observed variability rather than a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no mathematical model, free parameters, axioms, or new entities; the claim is a methodological recommendation grounded in referenced but undescribed empirical work.

pith-pipeline@v0.9.0 · 5393 in / 1030 out tokens · 49562 ms · 2026-05-10T17:03:42.487298+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

ACM Trans

URLhttps://papers.ssrn.com/abstract=5393256. F. Rejón-Guardia, S. Molinillo, and R. Anaya-Sánchez. Generative engine optimization: How search engines integrate AI-generated content into conventional queries. InEncyclopedia of Artificial Intelligence in Marketing, pages 1–8. Springer, 2025. E. Silliman, J. Boudet, and K. Robinson. Winning in the age of AI ...

work page doi:10.1145/1852102.1852106 2025
[2]

Assign ranksi= [1,2,3,4,5]

work page
[3]

Compute weighted sums: P yi = 1 + 2 + 3 + 4 + 10 = 20P i·y i = 1·1 + 2·2 + 3·3 + 4·4 + 5·10 = 80

work page
[4]

any brand

Apply Equation (5): G= 2×80 5×20 − 6 5 = 1.6−1.2 = 0.4 G= 0.4 indicates moderate inequality: the dominant domain (10 citations) accounts for 50% of all citations while the four others share the remaining 50% unevenly. J Convergence Analysis: How Many Runs Are Sufficient? J.1 Motivation and Method The stochastic nature of LLM outputs means that a single qu...

work page 2021