Measuring the Importance of User-Generated Content to Search Engines
Pith reviewed 2026-05-25 19:18 UTC · model grok-4.3
The pith
Wikipedia appears in over 80% of Google results pages for some query types.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Analyzing results for six types of important queries, the authors observe that Wikipedia appears in over 80% of results pages for some query types and is by far the most prevalent individual content source across all query types, thereby quantifying the extent to which search engines leverage user-generated content.
What carries the argument
Audit and classification of content sources appearing in Google search engine results pages for six selected query types.
If this is right
- Wikipedia is the dominant source in search results.
- User-generated content is essential for responding to a wide range of queries.
- The measurements inform discussions on economic benefits for content creators.
- Search engines derive substantial value from public user contributions.
Where Pith is reading between the lines
- The dependence may extend to other search engines or AI systems using public data.
- This could motivate calls for revenue-sharing mechanisms with content platforms like Wikipedia.
- Using actual user search logs rather than predefined query types might provide a more complete picture.
Load-bearing premise
The six selected query types and the scraping and classification methods yield an unbiased sample of user information needs and Google's source selection practices.
What would settle it
A replication using a broader or differently selected set of queries that shows substantially lower rates of Wikipedia inclusion would indicate the original sample overstated the importance.
Figures
read the original abstract
Search engines are some of the most popular and profitable intelligent technologies in existence. Recent research, however, has suggested that search engines may be surprisingly dependent on user-created content like Wikipedia articles to address user information needs. In this paper, we perform a rigorous audit of the extent to which Google leverages Wikipedia and other user-generated content to respond to queries. Analyzing results for six types of important queries (e.g. most popular, trending, expensive advertising), we observe that Wikipedia appears in over 80% of results pages for some query types and is by far the most prevalent individual content source across all query types. More generally, our results provide empirical information to inform a nascent but rapidly-growing debate surrounding a highly-consequential question: Do users provide enough value to intelligent technologies that they should receive more of the economic benefits from intelligent technologies?
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to conduct a rigorous audit of Google's use of user-generated content, particularly Wikipedia, by analyzing search results for six types of queries (most popular, trending, expensive advertising, etc.). It reports that Wikipedia appears in over 80% of results pages for some query types and is the most prevalent source overall, providing empirical data to inform debates on the economic value of user contributions to intelligent technologies.
Significance. If the empirical measurements hold, this work supplies direct observational evidence of search engine dependence on user-generated content, which is valuable for the growing discussion on compensating content creators. The approach is purely empirical with no fitted parameters or derivations, making the tallies falsifiable in principle if the protocol is reproducible.
major comments (2)
- [Abstract] Abstract: The claim of performing a 'rigorous audit' is undermined by the absence of any sampling frame for the queries, scraping protocol details (e.g., controls for device, location, time, or personalization), inter-rater reliability metrics, or error bars on the prevalence figures; without these, the 80% figure cannot be assessed for selection or measurement bias, which is load-bearing for the central prevalence claim.
- [Abstract] Abstract: The six query types are presented without justification or validation against actual user query distributions (such as from search logs), raising the possibility that the categories were selected in a way that overrepresents knowledge-base friendly queries, directly affecting the generalizability of the 'most prevalent' finding.
minor comments (1)
- The abstract mentions 'e.g. most popular, trending, expensive advertising' but does not list all six types explicitly; a complete enumeration would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed comments. We address each major point below and agree that the abstract would benefit from added methodological context.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim of performing a 'rigorous audit' is undermined by the absence of any sampling frame for the queries, scraping protocol details (e.g., controls for device, location, time, or personalization), inter-rater reliability metrics, or error bars on the prevalence figures; without these, the 80% figure cannot be assessed for selection or measurement bias, which is load-bearing for the central prevalence claim.
Authors: The abstract is brief by design, but the full manuscript specifies the query sources and a fixed scraping setup to limit personalization. We agree the abstract should note these elements and the exhaustive (rather than sampled) nature of the tallies, which explains the lack of error bars. We will revise the abstract accordingly. revision: yes
-
Referee: [Abstract] Abstract: The six query types are presented without justification or validation against actual user query distributions (such as from search logs), raising the possibility that the categories were selected in a way that overrepresents knowledge-base friendly queries, directly affecting the generalizability of the 'most prevalent' finding.
Authors: The types were selected to cover queries of high economic and social significance using public indicators; the manuscript does not claim they represent the full distribution of user queries. We will add a concise justification to the abstract while retaining the scope of the claims. revision: yes
Circularity Check
No significant circularity; pure empirical measurement
full rationale
The paper conducts an observational audit by selecting six query types, scraping Google results pages, and tallying the prevalence of Wikipedia and other user-generated content sources. No equations, derivations, fitted parameters, or predictions are present. Central claims consist of direct counts (e.g., Wikipedia in >80% of results for some types) from the collected data. No self-citations serve as load-bearing justifications for uniqueness or ansatzes, and the methodology does not reduce any result to its inputs by construction. Sampling representativeness is a methodological validity concern rather than a circularity issue.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The six query types (most popular, trending, expensive advertising, etc.) constitute a representative sample of important user information needs.
- domain assumption Search-result pages can be reliably parsed to identify and count distinct content sources.
Reference graph
Works this paper leans on
-
[1]
Moreover, Google makes over $20 bil-lion per year from search advertising revenue (Townsend
and Google.com is the most-visited website in the entire world (Alexa.com 2018). Moreover, Google makes over $20 bil-lion per year from search advertising revenue (Townsend
work page 2018
-
[2]
and Google’s market capitalization is one of the high-est in the world (Forbes 2018). However, very recent work has suggested that search en-gines, despite their power and profitability, may be surpris-ingly dependent on a resource that is both volunteer-created and freely available: user-generated content (UGC), and specifically Wikipedia. In particular,...
work page 2018
-
[3]
Please cite that version instead. improvements to search engine algorithms, for instance the introduction of deep learning (Clark 2015). While McMahon et al. showed that Google search users have a strong preference for Wikipedia pages when they are surfaced, McMahon et al.’s study design did not allow them to ask an equally important question: How often d...
work page 2015
-
[4]
and geographic content biases (Johnson et al. 2016; Hecht and Gergle 2009)– not only have an impact within the Wikipedia web site, but also affect popular search engines. More generally, our Wikipedia findings contribute to a growing discussion (e.g. Hecht 2017; McMahon, Johnson, and Hecht 2017; Lanier 2014; Posner and Weyl 2018; Vin-cent, Hecht, and Sen
work page 2016
-
[5]
Our results – along with those of McMahon et al
about the relationships between end users and intelligent technologies like search engines. Our results – along with those of McMahon et al. and others – highlight that end users are not just silent consumers of powerful intelligent technologies. Rather, through the con-tent that they create, end users play an absolutely critical role in helping these tec...
work page 2017
-
[6]
li-brary, which automates the desktop version of Chrome web browser. In this paper, we focus on desktop search and leave to future work extending our analyses to incorporate the nu-ances of mobile search (see Discussion below). We make our software available with this paper to allow others to re-purpose and/or replicate our approach1. We note that utiliz-...
work page 2015
-
[7]
- the first spot may receive up to 30% of all traffic, with the top three spots receiving 60% of all traffic (Insights 2013). Selecting Queries In the search literature – and certainly in the search auditing literature – deciding on a set of queries for an analysis is well-known to be challenging (Pan et al. 2007; McMahon, Johnson, and Hecht 2017; Hannak ...
work page 2013
-
[8]
to se-lect queries in a systematic way. For each of the three di-mensions above, we developed two separate categories of queries, leading to six total query categories. Each query cat-egory contains between 10-20 queries, a number selected to be practical with respect to the rate limit we imposed to avoid excessive querying. By considering three different...
work page 2018
-
[9]
and, as such, we selected these two cate-gories to represent high-revenue queries. To populate these categories with actual queries, we used Google Trends’ “Ex-plore” feature to obtain the top ten queries for “insurance” and for “loans” (in the U.S., from all of 2017). We used Google AdWords’ Keyword Planner to verify that the bids for these query categor...
work page 2017
-
[10]
Influential Queries: Query popularity and query revenue do not necessarily correlate strongly with the influence of a SERP on people’s lives. Some types of queries – e.g. queries related to a family member’s serious illness or queries re-lated to informing one’s political views – can have an out-sized impact (Epstein and Robertson 2015; Soldaini et al. 20...
work page 2015
-
[11]
that for-mally comprised the canonical search results page. Current SERPs contain multiple columns of content, and items like carousels (which have multiple links per row), answer boxes, and more. To understand the prominence of UGC on Google SERPs, it was important that we account for all this complexity. As such, in addition to standard “blue links”, ou...
work page 2010
-
[12]
A screenshot depicting a selection of elements on Google SERPs. “creative” (i.e. not a copy of some other content) and (2) if the content appeared to be authored outside of professional “routines and practices”. Coders used contextual infor-mation such as Twitter biographies or the presence of user reviews to judge whether the content appeared to be “pro-...
work page 2017
-
[13]
Hecht and Stephens 2014; Mahmud, Nich-ols, and Drews 2012; Jurgens et al
and computational social science (as well as many other fields) (e.g. Hecht and Stephens 2014; Mahmud, Nich-ols, and Drews 2012; Jurgens et al. 2015). We discuss pos-sible expansions of this work to different geographic con-texts in Future Work. To generate specific geographic coordinates for the strat-egy outlined above, we used the following approach:
work page 2014
-
[14]
Urban-Rural: Using the urban-rural classifications by the U.S. National Center for Health Statistics (NCHS) (In-gram and Franco 2014), we sampled 10 counties from the most urban and most rural classes. These NCHS classifi-cations are often leveraged in GeoHCI examining rural-urban issues (Colley et al. 2017; Thebault-Spieker, Hecht, and Terveen 2018; John...
work page 2014
-
[15]
Census American Community Survey 5-Year Estimates (U.S
Income: We selected the top and bottom 10 counties in terms of 2015 median income, according to the 2011-2015 U.S. Census American Community Survey 5-Year Estimates (U.S. Census Bureau 2011), and executed the county-to-coordinate mapping as described above
work page 2015
-
[16]
Presidential election and again executed the same county-to-coordinate mapping
Voting: We selected the top and bottom 10 counties in terms of percentage of votes for Hillary Clinton in the 2016 U.S. Presidential election and again executed the same county-to-coordinate mapping. This county-level data was published by Townhall (Townhall.com
work page 2016
-
[17]
and accessed via McGovern's repository (2017). Population-weighted Experiment: As reported below in Results, the rigorous geographic comparisons described above showed little evidence of geographic variation in met-rics of interest. As such, it was reasonable to use a single set of query locations to report our results. However, it was non-optimal to sele...
work page 2017
-
[18]
On the other hand, an example of a query for which Wikipedia is less important is the query “life insurance,” where Wikipedia showed up at rank nine. Beyond Wikipedia, Figure 1 additionally shows that Twitter is also important to Google’s ability to respond to queries in many of our categories. For instance, for most-popular and trending queries, the full...
work page 2018
-
[19]
Mads-bjerg 2017; Posner and Weyl 2018; Porter 2018; Kugler 2018)
and into mainstream debate (e.g. Mads-bjerg 2017; Posner and Weyl 2018; Porter 2018; Kugler 2018). This discussion centers on potential asymmetries in the relationship between users and lucrative intelligent tech-nologies: user-generated data is immensely important to such technologies, but many argue that users are not receiv-ing a proportional share of ...
work page 2017
-
[20]
have iden-tified information imbalances between intelligent technol-ogy owners and data creators as a key mechanism for the current distribution of economic benefits of intelligent tech-nologies. While the developers of intelligent technologies know many such technologies would struggle substantially without constant “data labor” by their users and others...
work page 2017
-
[21]
After all, most Wikipedia editors benefit heav-ily from their use of Google, and McMahon et al
have noted, the discussion about the distribution of the technological dividend must also consider the value of the service that intelligent technologies “trade” for data-gen-erating labor. After all, most Wikipedia editors benefit heav-ily from their use of Google, and McMahon et al. showed that Wikipedia itself does as well (McMahon, Johnson, and Hecht ...
work page 2017
-
[22]
have suggested that collective action by users – e.g. through boycotts, “data strikes”, or data unions – can be one possible solution. Indeed, recent research has high-lighted the potential impact that data strikes, boycotts, or combinations thereof could have on intelligent technologies (Vincent, Hecht, and Sen 2019). However, other, less con-frontationa...
work page 2019
-
[23]
Zhu, Kraut, and Kittur 2012; Zhu et al
to the collaboration patterns between editors that lead to the highest-quality content (e.g. Zhu, Kraut, and Kittur 2012; Zhu et al. 2013). Our results further bolster the importance of this literature by showing that the literature’s findings have implications far beyond the boundaries of Wikipedia. For instance, prior work has shown that the English Wik...
work page 2012
-
[24]
and similar patterns have been observed with respect to Wikipedia’s coverage of some geographic areas versus others (Johnson et al. 2016). Our results highlight that not only do these biases affect reader experience on Wikipedia, they also affect Google’s ability to address information needs associated with the disadvantaged topics. That is, if Wikipedia ...
work page 2016
-
[25]
– filtering out organizational and other professional accounts will be more difficult and is deserving of further research along the lines of McCorriston et al. (2015). Geographic Personalization and UGC Our geographic comparisons suggest that personalization based on geographic location may be non-substantial for certain types of search phenomena. This m...
work page 2015
-
[26]
Future studies should address this directly
and that platforms like Twitter are not equally popular in all countries (Schoonderwoerd 2013), ge-ography likely matters across national and linguistic bor-ders. Future studies should address this directly. Limitations As is typical in the search auditing literature, although we aimed to generate queries systemically, the immense num-ber of search engine...
work page 2013
-
[27]
or structured knowledge domains (e.g. Knowledge Vault (Dong et al. 2014)). Indeed, the introduction of these tech-nologies may be responsible for the decrease we observed in Wikipedia full-page incidence rate for medical queries relative to the work of Laurent and Vickers (2009) last dec-ade (although the methods are not directly comparable). Do-ing this ...
work page 2014
-
[28]
Google Turning Its Lucrative Web Search Over to AI Machines. www.bloomberg.com/news/articles/2015-10-26/google-turning-its-lucrative-web-search-over-to-ai-machines. Cohen, R.; and Ruths, D
work page 2015
-
[29]
2013 NCHS Urban-Rural Clas-sification Scheme for Counties. Insights, Chitika
work page 2013
-
[30]
The War over the Value of Personal Data
“The War over the Value of Personal Data.” Com-mun. ACM 61 (2): 17–19. https://doi.org/10.1145/3171580. Kulshrestha, J.; Eslami, M.; Messias, J.; Zafar, M. B.; Ghosh, S.; Gummadi, L.; and Karahalios, K
-
[31]
It’s Time to Tax Companies for Using Our Personal Data. nytimes.com/2017/11/14/business/dealbook/tax-ing-companies-for-using-our-personal-data.html Mahmud, J.; Nichols, J.; and Drews, C
work page 2017
-
[32]
mashable.com/2010/02/16/google-wikipedia-donation Pfeil, U.; Zaphiris, P.; and Ang, C
Google Gives $2 Million to Wikipedia’s Founda-tion. mashable.com/2010/02/16/google-wikipedia-donation Pfeil, U.; Zaphiris, P.; and Ang, C. S
work page 2010
-
[33]
Your Data Is Crucial to a Robotic Age. Shouldn’t You Be Paid for It? https://www.nytimes.com/2018/03/06/busi-ness/economy/user-data-pay.html. Posner, E. A.; and Weyl, E. G
work page 2018
-
[34]
Google Parent Alphabet Reports Soaring Ad Revenue, despite YouTube Backlash. www.washing-tonpost.com/news/the-switch/wp/2018/02/01/google-parent-al-phabet-reports-soaring-ad-revenue-despite-youtube-backlash. Shivar, N
work page 2018
-
[35]
https://town-hall.com/election/2016/president/
Election 2016 Results Map. https://town-hall.com/election/2016/president/. Townsend, T
work page 2016
-
[36]
2011-2015 American Community Sur-vey 5-Year Estimates. https://factfinder.census.gov. Van Deursen, A. J. A. M.; and Van Dijk, J. A. G. M
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.