arxiv: 2604.25707 · v2 · submitted 2026-04-28 · 💻 cs.IR

Recognition: unknown

From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms

Zhang Kai , He Xinyue , Yao Jingang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:16 UTC · model grok-4.3

classification 💻 cs.IR

keywords generative engine optimizationcitation selectioncitation absorptionAI search platformsgenerative search enginescitation influencecontent featuresmeasurement framework

0 comments

The pith

Citation selection and absorption diverge across generative search engines, with Perplexity and Google citing more sources while ChatGPT shows higher average influence from the sources it selects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage framework to measure how generative engines handle online information: first selecting which sources to cite, then absorbing content from those sources into the final answer. Using a large dataset of prompts across ChatGPT, Google AI Overview, and Perplexity, it shows that platforms differ sharply in volume of citations versus depth of influence per citation. Pages with high absorption influence tend to be longer, more structured, and richer in extractable elements like definitions, facts, and comparisons. This distinction matters because current optimization practices treat citation counts as the main goal, yet the results indicate absorption operates as a separate outcome that requires different content features.

Core claim

The paper claims that citation selection and citation absorption are distinct measurable stages in generative engines. Analysis of over 21,000 citations reveals Perplexity and Google select more sources on average, while ChatGPT selects fewer but incorporates higher average influence from fetched pages into its answers. High-influence pages exhibit greater length, structure, semantic alignment, and evidence richness such as definitions, numerical facts, comparisons, and procedural steps.

What carries the argument

The two-stage measurement framework that separates citation selection (search triggering and source choice) from citation absorption (contribution of language, evidence, structure, or facts to the generated answer).

If this is right

Optimization for generative engines requires separate tactics for increasing selection probability and increasing absorption influence.
Content with explicit definitions, comparisons, numerical facts, and procedural steps is more likely to be absorbed once cited.
Platforms show consistent differences in citation breadth versus depth, so uniform GEO strategies will not work across them.
Measurement of GEO success must track answer-level absorption rather than stopping at citation counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Content creators may need to redesign pages to emphasize extractable evidence blocks rather than relying on traditional SEO signals alone.
The divergence could mean that high-volume citation platforms reward discoverability while low-volume ones reward depth, creating different market niches for publishers.
If absorption features prove stable, they could be used to predict which pages will shape answers even before a query is issued.

Load-bearing premise

Features extracted from fetched pages such as length, structure, and evidence richness accurately proxy the degree to which the page's content was absorbed into the generated answer.

What would settle it

A side-by-side comparison of the framework's influence scores against human raters' judgments of how much each cited page actually shaped the final answer text.

Figures

Figures reproduced from arXiv: 2604.25707 by He Xinyue, Yao Jingang, Zhang Kai.

**Figure 1.** Figure 1: The GEO measurement problem decomposes into citation selection and citation absorption. Source: view at source ↗

**Figure 2.** Figure 2: Search triggering is near universal, but citation breadth differs sharply by platform. Source: geo view at source ↗

**Figure 3.** Figure 3: Source-type composition indicates that official, news, and vertical sources form the default candidate view at source ↗

**Figure 4.** Figure 4: ChatGPT has substantially higher mean citation influence among fetched pages. Source: geo-citation view at source ↗

**Figure 5.** Figure 5: Ratio of top-quartile to bottom-quartile page attributes by influence score. Source: geo-citation-lab view at source ↗

**Figure 6.** Figure 6: Mean and median influence by word-count bin. Source: geo-citation-lab public report. view at source ↗

**Figure 7.** Figure 7: Reported correlations between independent features and influence_score. Variables used directly in view at source ↗

**Figure 8.** Figure 8: Mean influence uplift associated with evidence genres. Q&A format is the important negative case. view at source ↗

**Figure 9.** Figure 9: Semantic role determines how much a citation matters. Source: geo-citation-lab public report. view at source ↗

**Figure 10.** Figure 10: Platform-specific correlations indicate that each engine weights semantic and structural features view at source ↗

**Figure 11.** Figure 11: Mean influence by question type. Source: geo-citation-lab public report. view at source ↗

read the original abstract

Generative search engines increasingly determine whether online information is merely discoverable, cited as a source, or actually absorbed into generated answers. This paper proposes a two-stage measurement framework for Generative Engine Optimization (GEO): citation selection, where a platform triggers search and chooses sources, and citation absorption, where a cited page contributes language, evidence, structure, or factual support to the final answer. We analyze the public geo-citation-lab dataset covering 602 controlled prompts across ChatGPT, Google AI Overview/Gemini, and Perplexity; 21,143 valid search-layer citations; 23,745 citation-level feature records; 18,151 successfully fetched pages; and 72 extracted features. The central descriptive finding is that citation breadth and citation depth diverge. Perplexity and Google cite more sources on average, while ChatGPT cites fewer sources but shows substantially higher average citation influence among fetched pages. High-influence pages tend to be longer, more structured, semantically aligned, and richer in extractable evidence such as definitions, numerical facts, comparisons, and procedural steps. The results suggest that GEO should be measured beyond citation counts, with answer-level absorption treated as a separate outcome.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's two-stage split between citation selection and absorption gives a practical new way to look at generative search, with data showing real platform differences, though the absorption proxy rests on unvalidated page features.

read the letter

The core contribution is the explicit distinction between selecting which pages to cite and then actually absorbing content from them into the generated answer. Using the public geo-citation-lab dataset of 602 controlled prompts, the authors compare ChatGPT, Google AI Overview, and Perplexity on 21k citations and 18k fetched pages. They report that Perplexity and Google pull in more sources on average while ChatGPT uses fewer but with higher average influence scores derived from 72 page features such as length, structure, semantic alignment, and presence of definitions or facts. High-influence pages tend to be longer and richer in extractable evidence. This descriptive contrast is the main empirical result, and treating absorption as a separate outcome from mere citation count is a reasonable shift given how generative engines work. The dataset scale and public release are clear positives, and the framework supplies a concrete measurement lens that prior GEO work largely lacked. The soft spot is the absorption proxy. The influence scores come from page-level features without reported direct validation such as token overlap, entity matching, or sentence similarity between the cited page and the final answer text. This raises the possibility that the scores partly capture what drives selection rather than post-selection contribution. The abstract also omits statistical tests or error bars, so the claimed divergence in depth versus breadth is harder to assess for robustness. Feature selection details would benefit from more transparency to avoid post-hoc concerns. This paper is for information retrieval researchers and digital marketing analysts tracking how AI search alters source visibility and usage. It deserves peer review because the framework is new, the data collection is substantial, and the platform comparisons are timely even if the proxy validation needs tightening.

Referee Report

3 major / 3 minor

Summary. The paper proposes a two-stage measurement framework for Generative Engine Optimization (GEO) that separates citation selection (platforms triggering search and choosing sources) from citation absorption (the degree to which a cited page contributes language, evidence, structure, or facts to the generated answer). Using the public geo-citation-lab dataset of 602 controlled prompts across ChatGPT, Google AI Overview/Gemini, and Perplexity, the analysis covers 21,143 search-layer citations, 23,745 citation-level feature records, 18,151 fetched pages, and 72 extracted features. The central descriptive finding is a divergence between citation breadth and depth: Perplexity and Google cite more sources on average, while ChatGPT cites fewer sources but exhibits substantially higher average citation influence among fetched pages. High-influence pages are longer, more structured, semantically aligned, and richer in extractable evidence such as definitions, facts, comparisons, and procedural steps. The results argue that GEO measurement must extend beyond citation counts to treat absorption as a distinct outcome.

Significance. If the absorption proxy holds, the work is significant for shifting GEO research from selection-only metrics to a fuller two-stage view, supported by a large-scale, controlled, public dataset that enables reproducibility. The empirical contrasts across three major platforms provide concrete data on how generative engines differ in source usage, and the feature-based characterization of high-influence pages offers actionable insights for content optimization. Strengths include the dataset scale, the public release of geo-citation-lab, and the clear separation of selection versus absorption stages, which could ground future platform comparisons and optimization studies.

major comments (3)

[§3.2 and §4.2] §3.2 (Citation Absorption Measurement) and §4.2 (Feature Extraction): The citation influence score is computed from the 72 page features (length, structure, semantic alignment, evidence richness) without any reported direct validation against content-contribution metrics such as token overlap, entity alignment, sentence-level similarity, or human judgments of absorption. This assumption is load-bearing for the central breadth-depth divergence claim, as the higher influence reported for ChatGPT could reflect page quality that predicts selection rather than post-selection absorption.
[Results section] Results section (descriptive statistics and platform comparisons): The reported differences in average citation counts and influence scores across platforms are presented without statistical tests, error bars, or confidence intervals, despite the large sample (over 21k citations). This weakens the claim of 'substantially higher' influence for ChatGPT and the overall divergence finding.
[§5] §5 (Discussion and Implications): The recommendation that GEO should be measured beyond citation counts treats absorption as a separate outcome, but no ablation study, sensitivity analysis, or robustness check on the 72-feature proxy is reported; post-hoc feature selection could therefore drive the characterization of high-influence pages.

minor comments (3)

[Abstract] Abstract: Dataset breakdowns by platform (e.g., citations per engine) are not provided, which would help readers interpret the platform-specific contrasts.
[§2] §2 (Related Work): Limited discussion of how the proposed framework differs from prior citation analysis in traditional web search or from existing GEO studies; a few additional references would clarify novelty.
[Figures and Tables] Figure captions and Table 1: The exact weighting or aggregation formula used to combine the 72 features into the influence score is not shown; adding this would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below with point-by-point responses. Revisions have been made to strengthen the presentation of the proxy measure, add statistical support, and include robustness checks while preserving the descriptive focus of the study.

read point-by-point responses

Referee: [§3.2 and §4.2] §3.2 (Citation Absorption Measurement) and §4.2 (Feature Extraction): The citation influence score is computed from the 72 page features (length, structure, semantic alignment, evidence richness) without any reported direct validation against content-contribution metrics such as token overlap, entity alignment, sentence-level similarity, or human judgments of absorption. This assumption is load-bearing for the central breadth-depth divergence claim, as the higher influence reported for ChatGPT could reflect page quality that predicts selection rather than post-selection absorption.

Authors: We agree that the influence score is a proxy and that direct validation against token overlap, entity alignment, or human judgments is absent. The 72 features were chosen a priori to operationalize absorption potential based on content attributes generative models are known to utilize. The dataset does not currently include token-level alignments between fetched pages and generated answers, precluding such validation without new data collection. In the revision we have added an explicit limitations paragraph in §3.2 acknowledging that high-influence pages may also be more selectable, and we report a qualitative review of 50 randomly sampled high-influence citations to illustrate absorption patterns. This is noted as an area for future extension rather than a completed validation. revision: partial
Referee: [Results section] Results section (descriptive statistics and platform comparisons): The reported differences in average citation counts and influence scores across platforms are presented without statistical tests, error bars, or confidence intervals, despite the large sample (over 21k citations). This weakens the claim of 'substantially higher' influence for ChatGPT and the overall divergence finding.

Authors: The results section is intentionally descriptive to characterize platform behaviors at scale. We accept that adding inferential statistics would improve rigor. The revised manuscript now includes Welch’s t-tests (with Bonferroni correction) for mean differences in citation counts and influence scores, 95% confidence intervals, and Cohen’s d effect sizes. These confirm statistical significance (p < 0.001) for the reported platform divergences, including the higher average influence for ChatGPT. revision: yes
Referee: [§5] §5 (Discussion and Implications): The recommendation that GEO should be measured beyond citation counts treats absorption as a separate outcome, but no ablation study, sensitivity analysis, or robustness check on the 72-feature proxy is reported; post-hoc feature selection could therefore drive the characterization of high-influence pages.

Authors: The 72 features were assembled from prior SEO and content-quality literature before any analysis, not selected post-hoc. To address robustness concerns we have added a sensitivity analysis in the revised §5 that recomputes influence scores under alternative aggregation schemes (equal weighting, category-only subsets, and exclusion of length). The platform divergence and the profile of high-influence pages (longer, structured, evidence-rich) remain stable. We also report a feature-category ablation showing that evidence-richness and structure contribute most to the score. revision: yes

Circularity Check

0 steps flagged

Empirical measurement framework with no circular derivation

full rationale

The paper presents a descriptive two-stage measurement framework applied to an external public dataset (geo-citation-lab) of 602 prompts, 21k citations, and 18k fetched pages. Central findings consist of direct empirical counts (average sources cited) and feature-based statistics (72 page attributes such as length and evidence richness correlated with influence scores). No equations, fitted parameters, or derivations are described that reduce by construction to the inputs or to self-citations; the analysis remains self-contained against the collected data without self-definitional loops or load-bearing prior-author results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that page features can stand in for absorption and on the representativeness of the controlled prompt set; no free parameters are explicitly fitted in the abstract, but the influence metric itself is an invented construct.

axioms (1)

domain assumption Fetched pages and their extracted features are sufficient proxies for the content actually absorbed by the generative model.
The absorption analysis depends entirely on post-fetch feature extraction rather than model trace or human rating of answer provenance.

invented entities (1)

citation influence score no independent evidence
purpose: Quantify the degree to which a cited page contributes language, evidence, or structure to the final generated answer.
Defined via the 72 extracted features; no independent validation outside the paper's own measurement is described.

pith-pipeline@v0.9.0 · 5514 in / 1409 out tokens · 70059 ms · 2026-05-07T15:16:20.777691+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 13 canonical work pages · 2 internal anchors

[1]

A two-stage formalization of GEO that separates citation selection from citation absorption
[2]

A cross-platform empirical summary of ChatGPT, Google AI Overview/Gemini, and Per- plexity using the public geo-citation-lab dataset
[3]

A measurement interpretation of influence_score as an answer-level absorption proxy, including its mathematical components and the modeling restrictions that follow from its construction
[4]

A set of counter-intuitive empirical findings that challenge shallow GEO heuristics such as maximizing citation count or converting all content into Q&A pages
[5]

3 Related Work 3.1 Generative Engine Optimization GEO was formalized by Aggarwal et al

A scientific self-audit and reproducibility checklist designed to support independent review and replication. 3 Related Work 3.1 Generative Engine Optimization GEO was formalized by Aggarwal et al. as a framework for improving content visibility in generative engine responses [1]. That work introduced GEO-bench and showed that black-box content inter- ven...
[6]

Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., and Deshpande, A. (2024). GEO: Generative Engine Optimization. arXiv:2311.09735. Accepted to KDD 2024

work page arXiv 2024
[7]

(2025).Generative Engine Optimization: How to Dominate AI Search

Chen, M., Wang, X., Chen, K., and Koudas, N. (2025).Generative Engine Optimization: How to Dominate AI Search. arXiv:2509.08919

work page arXiv 2025
[8]

(2026).Diagnosing and Repairing Citation Failures in Generative Engine Optimization

Tian, Z., Chen, Y., Tang, Y., Liu, J., and Jia, R. (2026).Diagnosing and Repairing Citation Failures in Generative Engine Optimization. arXiv:2603.09296

work page arXiv 2026
[9]

Think Before Writing: Feature-Level Multi-Objective Optimization for Generative Citation Visibility

Liu, Z., and Xu, P. (2026).Think Before Writing: Feature-Level Multi-Objective Optimization for Generative Citation Visibility. arXiv:2604.19113

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

(2026).AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization

Yuan, J., Wang, J., Wang, Z., Sun, Q., Wang, R., and Li, J. (2026).AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization. arXiv:2603.20213

work page arXiv 2026
[11]

(2026).Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior

Yu, J., Yang, M., Ding, Y., and Sato, H. (2026).Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior. arXiv:2603.29979

work page arXiv 2026
[12]

(2024).Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses

Narayanan Venkit, P., Laban, P., Zhou, Y., Mao, Y., and Wu, C.-S. (2024).Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses. arXiv:2410.22349

work page arXiv 2024
[13]

(2025).News Source Citing Patterns in AI Search Systems

Yang, K.-C. (2025).News Source Citing Patterns in AI Search Systems. arXiv:2507.05301

work page arXiv 2025
[14]

(2025).CiteEval: Principle-Driven Citation Evaluation for Source Attribution

Xu, Y., Qi, P., Chen, J., Liu, K., Han, R., Liu, L., Min, B., Castelli, V., Gupta, A., and Wang, Z. (2025).CiteEval: Principle-Driven Citation Evaluation for Source Attribution. arXiv:2506.01829

work page arXiv 2025
[15]

(2024).On the Capacity of Citation Generation by Large Language Models

Qian, H., Fan, Y., Zhang, R., and Guo, J. (2024).On the Capacity of Citation Generation by Large Language Models. arXiv:2410.11217

work page arXiv 2024
[16]

Gummadi, and Muhammad Bilal Zafar

Kirsten, E., Grosse Perdekamp, J., Upadhyay, M., Gummadi, K. P., and Zafar, M. B. (2025). Characterizing Web Search in The Age of Generative AI. arXiv:2510.11560

work page arXiv 2025
[17]

(2020).Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al. (2020).Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Process- ing Systems

2020
[18]

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP. From Citation Selection to Citation Absorption 27

2020
[19]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, R., et al. (2021).WebGPT: Browser-assisted question-answering with human feedback. arXiv:2112.09332

work page internal anchor Pith review arXiv 2021
[20]

Teaching Language Models to Support Answers with Verified Quotes.CoRR, abs/2203.11147,

Menick, J., et al. (2022).Teaching language models to support answers with verified quotes. arXiv:2203.11147

work page arXiv 2022
[21]

(2026).A dataset and analysis pipeline for studying how AI search engines select and use citations

geo-citation-lab repository. (2026).A dataset and analysis pipeline for studying how AI search engines select and use citations. GitHub: https://github.com/yaojingang/geo-citation-lab. Accessed April 28, 2026

2026
[22]

(2026).Overseas GEO Research Long Report, recalculated version

geo-citation-lab final report. (2026).Overseas GEO Research Long Report, recalculated version. https://yaojingang.github.io/geo-citation-lab/04-repet/final_report.html. Accessed April 28, 2026

2026
[23]

(2026).GitHub profile

Yao Jingang. (2026).GitHub profile. https://github.com/yaojingang. Accessed April 29, 2026

2026
[24]

(2026).X profile

Yao Jingang. (2026).X profile. https://x.com/yaojingang. Accessed April 29, 2026

2026