Don't Measure Once: Measuring Visibility in AI Search (GEO)
Pith reviewed 2026-05-10 17:03 UTC · model grok-4.3
The pith
Visibility in AI search must be assessed through repeated measurements because single queries produce unreliable snapshots due to output variability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The inherent probabilistic nature of large language model-based chat systems changes the paradigm of visibility measurement. Answers can vary across runs, prompts, and time, making one-off observations unreliable. Drawing on empirical studies, visibility in generative engine optimization must be characterized as a distribution rather than a single-point outcome to accurately assess a brand's performance.
What carries the argument
Modeling visibility as a distribution across repeated queries to capture the probabilistic variation in AI search outputs.
If this is right
- GEO assessments must incorporate multiple samples to produce reliable rankings and visibility scores.
- Metrics for AI search should include statistical measures such as variance or probability ranges instead of fixed positions.
- Optimization efforts need to account for inconsistency in model responses rather than targeting a single expected output.
- Evaluation protocols in information retrieval for generative systems require new repeated-measurement standards.
Where Pith is reading between the lines
- Standard SEO reporting tools may need to evolve into dashboards that display visibility distributions with confidence intervals.
- The approach could extend to other stochastic retrieval systems where output consistency is low.
- Businesses optimizing for AI search might adopt ongoing monitoring schedules instead of periodic single checks.
Load-bearing premise
The variability seen in AI search outputs is large enough and consistent enough that single-point measurements become practically unusable for assessing GEO performance.
What would settle it
An experiment that repeatedly queries the same set of prompts and finds negligible variation in visibility outcomes across runs would show that single measurements are sufficient.
Figures
read the original abstract
As large language model-based chat systems become increasingly widely used, generative engine optimization (GEO) has emerged as an important problem for information access and retrieval. In classical search engines, results are comparatively transparent and stable: a single query often provides a representative snapshot of where a page or brand appears relative to competitors. The inherent probabilistic nature of AI search changes this paradigm. Answers can vary across runs, prompts, and time, making one-off observations unreliable. Drawing on empirical studies, our findings underscore the need for repeated measurements to assess a brand's GEO performance and to characterize visibility as a distribution rather than a single-point outcome.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that unlike classical search engines, where a single query provides a stable snapshot of visibility, AI search (generative engine optimization or GEO) is inherently probabilistic. LLM outputs vary across runs, prompts, and time, rendering one-off measurements unreliable; visibility must instead be assessed as a distribution via repeated measurements. The argument draws on unspecified empirical studies.
Significance. If substantiated with quantitative evidence that variability is large and consistent enough to make single measurements unusable in practice, the result would meaningfully shift evaluation practices in information retrieval and digital marketing, requiring statistical rather than deterministic approaches to GEO assessment.
major comments (1)
- [Abstract] Abstract: the central claim that 'one-off observations [are] unreliable' rests entirely on a reference to 'empirical studies' with no accompanying methods, sample sizes, quantitative results (e.g., run-to-run standard deviation of citation frequency, position, or sentiment scores), comparison to classical search variance, or decision threshold for when single measurements fail. This is load-bearing because the paper's thesis is that the observed variability is practically large enough to necessitate repeated measurements; without these data the assertion remains untested.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive critique. The feedback correctly identifies a gap in substantiation that we will address through revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'one-off observations [are] unreliable' rests entirely on a reference to 'empirical studies' with no accompanying methods, sample sizes, quantitative results (e.g., run-to-run standard deviation of citation frequency, position, or sentiment scores), comparison to classical search variance, or decision threshold for when single measurements fail. This is load-bearing because the paper's thesis is that the observed variability is practically large enough to necessitate repeated measurements; without these data the assertion remains untested.
Authors: We agree that the abstract, as written, does not supply the quantitative details needed to make the central claim self-contained and testable. The manuscript references empirical observations of output variability but does not report methods, sample sizes, or specific statistics in the abstract or with sufficient explicitness elsewhere. In the revised manuscript we will (1) expand the abstract to include concise quantitative summaries (e.g., number of queries, runs per query, observed standard deviations in citation frequency and position), (2) add a dedicated methods/results subsection that reports the experimental protocol, variance measurements, a direct comparison to the stability of classical search rankings, and a discussion of practical thresholds at which single measurements become unreliable, and (3) ensure all claims are tied to these data rather than left as an unspecified reference. These changes will be made in the next version. revision: yes
Circularity Check
No circularity: argument rests on cited empirical observations without self-referential reduction
full rationale
The paper advances a conceptual claim that probabilistic variation in LLM outputs renders single GEO visibility measurements unreliable, supported by reference to external empirical studies. No equations, derivations, fitted parameters, or self-citations appear in the provided text. The central assertion does not reduce to a definition, fit, or prior author result by construction; it functions as an interpretive recommendation grounded in observed variability rather than a closed logical loop.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://papers.ssrn.com/abstract=5393256. F. Rejón-Guardia, S. Molinillo, and R. Anaya-Sánchez. Generative engine optimization: How search engines integrate AI-generated content into conventional queries. InEncyclopedia of Artificial Intelligence in Marketing, pages 1–8. Springer, 2025. E. Silliman, J. Boudet, and K. Robinson. Winning in the age of AI ...
-
[2]
Assign ranksi= [1,2,3,4,5]
-
[3]
Compute weighted sums: P yi = 1 + 2 + 3 + 4 + 10 = 20P i·y i = 1·1 + 2·2 + 3·3 + 4·4 + 5·10 = 80
-
[4]
Apply Equation (5): G= 2×80 5×20 − 6 5 = 1.6−1.2 = 0.4 G= 0.4 indicates moderate inequality: the dominant domain (10 citations) accounts for 50% of all citations while the four others share the remaining 50% unevenly. J Convergence Analysis: How Many Runs Are Sufficient? J.1 Motivation and Method The stochastic nature of LLM outputs means that a single qu...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.