Measuring the Importance of User-Generated Content to Search Engines

Brent Hecht; Isaac Johnson; Nicholas Vincent; Patrick Sheehan

arxiv: 1906.08576 · v1 · pith:6CEJ5T62new · submitted 2019-06-20 · 💻 cs.CY

Measuring the Importance of User-Generated Content to Search Engines

Nicholas Vincent , Isaac Johnson , Patrick Sheehan , Brent Hecht This is my paper

Pith reviewed 2026-05-25 19:18 UTC · model grok-4.3

classification 💻 cs.CY

keywords search enginesuser-generated contentWikipediaGooglequery analysiscontent sourcesinformation retrievaleconomic benefits

0 comments

The pith

Wikipedia appears in over 80% of Google results pages for some query types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a rigorous audit of Google search results for six types of queries to determine the role of user-generated content. Wikipedia is found to appear in over 80% of results pages for some categories and stands as the most prevalent source across all types examined. This provides empirical evidence of search engines' dependence on user-created materials to fulfill information needs. The findings are positioned to contribute to debates about whether users should receive greater economic returns from the value their content adds to intelligent systems.

Core claim

Analyzing results for six types of important queries, the authors observe that Wikipedia appears in over 80% of results pages for some query types and is by far the most prevalent individual content source across all query types, thereby quantifying the extent to which search engines leverage user-generated content.

What carries the argument

Audit and classification of content sources appearing in Google search engine results pages for six selected query types.

If this is right

Wikipedia is the dominant source in search results.
User-generated content is essential for responding to a wide range of queries.
The measurements inform discussions on economic benefits for content creators.
Search engines derive substantial value from public user contributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dependence may extend to other search engines or AI systems using public data.
This could motivate calls for revenue-sharing mechanisms with content platforms like Wikipedia.
Using actual user search logs rather than predefined query types might provide a more complete picture.

Load-bearing premise

The six selected query types and the scraping and classification methods yield an unbiased sample of user information needs and Google's source selection practices.

What would settle it

A replication using a broader or differently selected set of queries that shows substantially lower rates of Wikipedia inclusion would indicate the original sample overstated the importance.

Figures

Figures reproduced from arXiv: 1906.08576 by Brent Hecht, Isaac Johnson, Nicholas Vincent, Patrick Sheehan.

**Figure 2.** Figure 2: This figure summarizes key metrics for all UGC domains in our study and the top 5 non [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Search engines are some of the most popular and profitable intelligent technologies in existence. Recent research, however, has suggested that search engines may be surprisingly dependent on user-created content like Wikipedia articles to address user information needs. In this paper, we perform a rigorous audit of the extent to which Google leverages Wikipedia and other user-generated content to respond to queries. Analyzing results for six types of important queries (e.g. most popular, trending, expensive advertising), we observe that Wikipedia appears in over 80% of results pages for some query types and is by far the most prevalent individual content source across all query types. More generally, our results provide empirical information to inform a nascent but rapidly-growing debate surrounding a highly-consequential question: Do users provide enough value to intelligent technologies that they should receive more of the economic benefits from intelligent technologies?

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies new counts on Wikipedia's heavy presence in Google results for six query types, but the methods details are missing so the 80% figure is hard to assess.

read the letter

The key takeaway is that the authors audited Google results across six query categories and found Wikipedia in over 80% of pages for some types, far ahead of any other single source. That gives fresh numbers on search engine reliance on user-generated content, which can feed into economic and regulatory arguments about value capture from platforms. They cover a range of query types including popular, trending, and high-ad ones, and they tally sources systematically rather than just noting the dependence in passing. That scale of audited counts is new compared to earlier work. The approach is straightforward empirical measurement with no fitted models or derivations, so the results stand or fall on the data collection itself. The main gap is in the audit procedure. The abstract mentions a rigorous audit but gives no sampling frame for the queries, no scraping protocol, no inter-rater checks, and no error bars. Without those, it is difficult to rule out selection bias in the query categories or measurement issues in how sources were classified. The stress-test concern about whether the six types and the pipeline represent real user needs or Google's typical behavior lands directly on the abstract, which does not address it. This paper is aimed at researchers working on platform economics, search behavior, or the value of user contributions. A reader who wants concrete prevalence statistics across query classes will get something useful to cite or build on, but anyone evaluating the strength of the dependence claim will need the full methods to decide. The work is coherent on its own terms as an empirical tally and shows honest engagement with the literature on UGC dependence. It deserves peer review so referees can request the missing protocol details and check reproducibility.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to conduct a rigorous audit of Google's use of user-generated content, particularly Wikipedia, by analyzing search results for six types of queries (most popular, trending, expensive advertising, etc.). It reports that Wikipedia appears in over 80% of results pages for some query types and is the most prevalent source overall, providing empirical data to inform debates on the economic value of user contributions to intelligent technologies.

Significance. If the empirical measurements hold, this work supplies direct observational evidence of search engine dependence on user-generated content, which is valuable for the growing discussion on compensating content creators. The approach is purely empirical with no fitted parameters or derivations, making the tallies falsifiable in principle if the protocol is reproducible.

major comments (2)

[Abstract] Abstract: The claim of performing a 'rigorous audit' is undermined by the absence of any sampling frame for the queries, scraping protocol details (e.g., controls for device, location, time, or personalization), inter-rater reliability metrics, or error bars on the prevalence figures; without these, the 80% figure cannot be assessed for selection or measurement bias, which is load-bearing for the central prevalence claim.
[Abstract] Abstract: The six query types are presented without justification or validation against actual user query distributions (such as from search logs), raising the possibility that the categories were selected in a way that overrepresents knowledge-base friendly queries, directly affecting the generalizability of the 'most prevalent' finding.

minor comments (1)

The abstract mentions 'e.g. most popular, trending, expensive advertising' but does not list all six types explicitly; a complete enumeration would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. We address each major point below and agree that the abstract would benefit from added methodological context.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of performing a 'rigorous audit' is undermined by the absence of any sampling frame for the queries, scraping protocol details (e.g., controls for device, location, time, or personalization), inter-rater reliability metrics, or error bars on the prevalence figures; without these, the 80% figure cannot be assessed for selection or measurement bias, which is load-bearing for the central prevalence claim.

Authors: The abstract is brief by design, but the full manuscript specifies the query sources and a fixed scraping setup to limit personalization. We agree the abstract should note these elements and the exhaustive (rather than sampled) nature of the tallies, which explains the lack of error bars. We will revise the abstract accordingly. revision: yes
Referee: [Abstract] Abstract: The six query types are presented without justification or validation against actual user query distributions (such as from search logs), raising the possibility that the categories were selected in a way that overrepresents knowledge-base friendly queries, directly affecting the generalizability of the 'most prevalent' finding.

Authors: The types were selected to cover queries of high economic and social significance using public indicators; the manuscript does not claim they represent the full distribution of user queries. We will add a concise justification to the abstract while retaining the scope of the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pure empirical measurement

full rationale

The paper conducts an observational audit by selecting six query types, scraping Google results pages, and tallying the prevalence of Wikipedia and other user-generated content sources. No equations, derivations, fitted parameters, or predictions are present. Central claims consist of direct counts (e.g., Wikipedia in >80% of results for some types) from the collected data. No self-citations serve as load-bearing justifications for uniqueness or ansatzes, and the methodology does not reduce any result to its inputs by construction. Sampling representativeness is a methodological validity concern rather than a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical audit rests on domain assumptions about query representativeness and source classification validity rather than free parameters or new entities.

axioms (2)

domain assumption The six query types (most popular, trending, expensive advertising, etc.) constitute a representative sample of important user information needs.
Abstract states these categories were chosen as important without further justification of coverage.
domain assumption Search-result pages can be reliably parsed to identify and count distinct content sources.
Measurement of prevalence depends on accurate source attribution in scraped results.

pith-pipeline@v0.9.0 · 5669 in / 1187 out tokens · 26600 ms · 2026-05-25T19:18:32.355197+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Moreover, Google makes over $20 bil-lion per year from search advertising revenue (Townsend

and Google.com is the most-visited website in the entire world (Alexa.com 2018). Moreover, Google makes over $20 bil-lion per year from search advertising revenue (Townsend

work page 2018
[2]

and Google’s market capitalization is one of the high-est in the world (Forbes 2018). However, very recent work has suggested that search en-gines, despite their power and profitability, may be surpris-ingly dependent on a resource that is both volunteer-created and freely available: user-generated content (UGC), and specifically Wikipedia. In particular,...

work page 2018
[3]

top three links

Please cite that version instead. improvements to search engine algorithms, for instance the introduction of deep learning (Clark 2015). While McMahon et al. showed that Google search users have a strong preference for Wikipedia pages when they are surfaced, McMahon et al.’s study design did not allow them to ask an equally important question: How often d...

work page 2015
[4]

2016; Hecht and Gergle 2009)– not only have an impact within the Wikipedia web site, but also affect popular search engines

and geographic content biases (Johnson et al. 2016; Hecht and Gergle 2009)– not only have an impact within the Wikipedia web site, but also affect popular search engines. More generally, our Wikipedia findings contribute to a growing discussion (e.g. Hecht 2017; McMahon, Johnson, and Hecht 2017; Lanier 2014; Posner and Weyl 2018; Vin-cent, Hecht, and Sen

work page 2016
[5]

Our results – along with those of McMahon et al

about the relationships between end users and intelligent technologies like search engines. Our results – along with those of McMahon et al. and others – highlight that end users are not just silent consumers of powerful intelligent technologies. Rather, through the con-tent that they create, end users play an absolutely critical role in helping these tec...

work page 2017
[6]

update location

li-brary, which automates the desktop version of Chrome web browser. In this paper, we focus on desktop search and leave to future work extending our analyses to incorporate the nu-ances of mobile search (see Discussion below). We make our software available with this paper to allow others to re-purpose and/or replicate our approach1. We note that utiliz-...

work page 2015
[7]

- the first spot may receive up to 30% of all traffic, with the top three spots receiving 60% of all traffic (Insights 2013). Selecting Queries In the search literature – and certainly in the search auditing literature – deciding on a set of queries for an analysis is well-known to be challenging (Pan et al. 2007; McMahon, Johnson, and Hecht 2017; Hannak ...

work page 2013
[8]

trending searches

to se-lect queries in a systematic way. For each of the three di-mensions above, we developed two separate categories of queries, leading to six total query categories. Each query cat-egory contains between 10-20 queries, a number selected to be practical with respect to the rate limit we imposed to avoid excessive querying. By considering three different...

work page 2018
[9]

Ex-plore

and, as such, we selected these two cate-gories to represent high-revenue queries. To populate these categories with actual queries, we used Google Trends’ “Ex-plore” feature to obtain the top ten queries for “insurance” and for “loans” (in the U.S., from all of 2017). We used Google AdWords’ Keyword Planner to verify that the bids for these query categor...

work page 2017
[10]

ten blue links

Influential Queries: Query popularity and query revenue do not necessarily correlate strongly with the influence of a SERP on people’s lives. Some types of queries – e.g. queries related to a family member’s serious illness or queries re-lated to informing one’s political views – can have an out-sized impact (Epstein and Robertson 2015; Soldaini et al. 20...

work page 2015
[11]

blue links

that for-mally comprised the canonical search results page. Current SERPs contain multiple columns of content, and items like carousels (which have multiple links per row), answer boxes, and more. To understand the prominence of UGC on Google SERPs, it was important that we account for all this complexity. As such, in addition to standard “blue links”, ou...

work page 2010
[12]

creative

A screenshot depicting a selection of elements on Google SERPs. “creative” (i.e. not a copy of some other content) and (2) if the content appeared to be authored outside of professional “routines and practices”. Coders used contextual infor-mation such as Twitter biographies or the presence of user reviews to judge whether the content appeared to be “pro-...

work page 2017
[13]

Hecht and Stephens 2014; Mahmud, Nich-ols, and Drews 2012; Jurgens et al

and computational social science (as well as many other fields) (e.g. Hecht and Stephens 2014; Mahmud, Nich-ols, and Drews 2012; Jurgens et al. 2015). We discuss pos-sible expansions of this work to different geographic con-texts in Future Work. To generate specific geographic coordinates for the strat-egy outlined above, we used the following approach:

work page 2014
[14]

National Center for Health Statistics (NCHS) (In-gram and Franco 2014), we sampled 10 counties from the most urban and most rural classes

Urban-Rural: Using the urban-rural classifications by the U.S. National Center for Health Statistics (NCHS) (In-gram and Franco 2014), we sampled 10 counties from the most urban and most rural classes. These NCHS classifi-cations are often leveraged in GeoHCI examining rural-urban issues (Colley et al. 2017; Thebault-Spieker, Hecht, and Terveen 2018; John...

work page 2014
[15]

Census American Community Survey 5-Year Estimates (U.S

Income: We selected the top and bottom 10 counties in terms of 2015 median income, according to the 2011-2015 U.S. Census American Community Survey 5-Year Estimates (U.S. Census Bureau 2011), and executed the county-to-coordinate mapping as described above

work page 2015
[16]

Presidential election and again executed the same county-to-coordinate mapping

Voting: We selected the top and bottom 10 counties in terms of percentage of votes for Hillary Clinton in the 2016 U.S. Presidential election and again executed the same county-to-coordinate mapping. This county-level data was published by Townhall (Townhall.com

work page 2016
[17]

and accessed via McGovern's repository (2017). Population-weighted Experiment: As reported below in Results, the rigorous geographic comparisons described above showed little evidence of geographic variation in met-rics of interest. As such, it was reasonable to use a single set of query locations to report our results. However, it was non-optimal to sele...

work page 2017
[18]

life insurance,

On the other hand, an example of a query for which Wikipedia is less important is the query “life insurance,” where Wikipedia showed up at rank nine. Beyond Wikipedia, Figure 1 additionally shows that Twitter is also important to Google’s ability to respond to queries in many of our categories. For instance, for most-popular and trending queries, the full...

work page 2018
[19]

Mads-bjerg 2017; Posner and Weyl 2018; Porter 2018; Kugler 2018)

and into mainstream debate (e.g. Mads-bjerg 2017; Posner and Weyl 2018; Porter 2018; Kugler 2018). This discussion centers on potential asymmetries in the relationship between users and lucrative intelligent tech-nologies: user-generated data is immensely important to such technologies, but many argue that users are not receiv-ing a proportional share of ...

work page 2017
[20]

data labor

have iden-tified information imbalances between intelligent technol-ogy owners and data creators as a key mechanism for the current distribution of economic benefits of intelligent tech-nologies. While the developers of intelligent technologies know many such technologies would struggle substantially without constant “data labor” by their users and others...

work page 2017
[21]

After all, most Wikipedia editors benefit heav-ily from their use of Google, and McMahon et al

have noted, the discussion about the distribution of the technological dividend must also consider the value of the service that intelligent technologies “trade” for data-gen-erating labor. After all, most Wikipedia editors benefit heav-ily from their use of Google, and McMahon et al. showed that Wikipedia itself does as well (McMahon, Johnson, and Hecht ...

work page 2017
[22]

data strikes

have suggested that collective action by users – e.g. through boycotts, “data strikes”, or data unions – can be one possible solution. Indeed, recent research has high-lighted the potential impact that data strikes, boycotts, or combinations thereof could have on intelligent technologies (Vincent, Hecht, and Sen 2019). However, other, less con-frontationa...

work page 2019
[23]

Zhu, Kraut, and Kittur 2012; Zhu et al

to the collaboration patterns between editors that lead to the highest-quality content (e.g. Zhu, Kraut, and Kittur 2012; Zhu et al. 2013). Our results further bolster the importance of this literature by showing that the literature’s findings have implications far beyond the boundaries of Wikipedia. For instance, prior work has shown that the English Wik...

work page 2012
[24]

and similar patterns have been observed with respect to Wikipedia’s coverage of some geographic areas versus others (Johnson et al. 2016). Our results highlight that not only do these biases affect reader experience on Wikipedia, they also affect Google’s ability to address information needs associated with the disadvantaged topics. That is, if Wikipedia ...

work page 2016
[25]

– filtering out organizational and other professional accounts will be more difficult and is deserving of further research along the lines of McCorriston et al. (2015). Geographic Personalization and UGC Our geographic comparisons suggest that personalization based on geographic location may be non-substantial for certain types of search phenomena. This m...

work page 2015
[26]

Future studies should address this directly

and that platforms like Twitter are not equally popular in all countries (Schoonderwoerd 2013), ge-ography likely matters across national and linguistic bor-ders. Future studies should address this directly. Limitations As is typical in the search auditing literature, although we aimed to generate queries systemically, the immense num-ber of search engine...

work page 2013
[27]

Knowledge Vault (Dong et al

or structured knowledge domains (e.g. Knowledge Vault (Dong et al. 2014)). Indeed, the introduction of these tech-nologies may be responsible for the decrease we observed in Wikipedia full-page incidence rate for medical queries relative to the work of Laurent and Vickers (2009) last dec-ade (although the methods are not directly comparable). Do-ing this ...

work page 2014
[28]

www.bloomberg.com/news/articles/2015-10-26/google-turning-its-lucrative-web-search-over-to-ai-machines

Google Turning Its Lucrative Web Search Over to AI Machines. www.bloomberg.com/news/articles/2015-10-26/google-turning-its-lucrative-web-search-over-to-ai-machines. Cohen, R.; and Ruths, D

work page 2015
[29]

Insights, Chitika

2013 NCHS Urban-Rural Clas-sification Scheme for Counties. Insights, Chitika

work page 2013
[30]

The War over the Value of Personal Data

“The War over the Value of Personal Data.” Com-mun. ACM 61 (2): 17–19. https://doi.org/10.1145/3171580. Kulshrestha, J.; Eslami, M.; Messias, J.; Zafar, M. B.; Ghosh, S.; Gummadi, L.; and Karahalios, K

work page doi:10.1145/3171580
[31]

nytimes.com/2017/11/14/business/dealbook/tax-ing-companies-for-using-our-personal-data.html Mahmud, J.; Nichols, J.; and Drews, C

It’s Time to Tax Companies for Using Our Personal Data. nytimes.com/2017/11/14/business/dealbook/tax-ing-companies-for-using-our-personal-data.html Mahmud, J.; Nichols, J.; and Drews, C

work page 2017
[32]

mashable.com/2010/02/16/google-wikipedia-donation Pfeil, U.; Zaphiris, P.; and Ang, C

Google Gives $2 Million to Wikipedia’s Founda-tion. mashable.com/2010/02/16/google-wikipedia-donation Pfeil, U.; Zaphiris, P.; and Ang, C. S

work page 2010
[33]

Shouldn’t You Be Paid for It? https://www.nytimes.com/2018/03/06/busi-ness/economy/user-data-pay.html

Your Data Is Crucial to a Robotic Age. Shouldn’t You Be Paid for It? https://www.nytimes.com/2018/03/06/busi-ness/economy/user-data-pay.html. Posner, E. A.; and Weyl, E. G

work page 2018
[34]

www.washing-tonpost.com/news/the-switch/wp/2018/02/01/google-parent-al-phabet-reports-soaring-ad-revenue-despite-youtube-backlash

Google Parent Alphabet Reports Soaring Ad Revenue, despite YouTube Backlash. www.washing-tonpost.com/news/the-switch/wp/2018/02/01/google-parent-al-phabet-reports-soaring-ad-revenue-despite-youtube-backlash. Shivar, N

work page 2018
[35]

https://town-hall.com/election/2016/president/

Election 2016 Results Map. https://town-hall.com/election/2016/president/. Townsend, T

work page 2016
[36]

https://factfinder.census.gov

2011-2015 American Community Sur-vey 5-Year Estimates. https://factfinder.census.gov. Van Deursen, A. J. A. M.; and Van Dijk, J. A. G. M

work page 2011

[1] [1]

Moreover, Google makes over $20 bil-lion per year from search advertising revenue (Townsend

and Google.com is the most-visited website in the entire world (Alexa.com 2018). Moreover, Google makes over $20 bil-lion per year from search advertising revenue (Townsend

work page 2018

[2] [2]

and Google’s market capitalization is one of the high-est in the world (Forbes 2018). However, very recent work has suggested that search en-gines, despite their power and profitability, may be surpris-ingly dependent on a resource that is both volunteer-created and freely available: user-generated content (UGC), and specifically Wikipedia. In particular,...

work page 2018

[3] [3]

top three links

Please cite that version instead. improvements to search engine algorithms, for instance the introduction of deep learning (Clark 2015). While McMahon et al. showed that Google search users have a strong preference for Wikipedia pages when they are surfaced, McMahon et al.’s study design did not allow them to ask an equally important question: How often d...

work page 2015

[4] [4]

2016; Hecht and Gergle 2009)– not only have an impact within the Wikipedia web site, but also affect popular search engines

and geographic content biases (Johnson et al. 2016; Hecht and Gergle 2009)– not only have an impact within the Wikipedia web site, but also affect popular search engines. More generally, our Wikipedia findings contribute to a growing discussion (e.g. Hecht 2017; McMahon, Johnson, and Hecht 2017; Lanier 2014; Posner and Weyl 2018; Vin-cent, Hecht, and Sen

work page 2016

[5] [5]

Our results – along with those of McMahon et al

about the relationships between end users and intelligent technologies like search engines. Our results – along with those of McMahon et al. and others – highlight that end users are not just silent consumers of powerful intelligent technologies. Rather, through the con-tent that they create, end users play an absolutely critical role in helping these tec...

work page 2017

[6] [6]

update location

li-brary, which automates the desktop version of Chrome web browser. In this paper, we focus on desktop search and leave to future work extending our analyses to incorporate the nu-ances of mobile search (see Discussion below). We make our software available with this paper to allow others to re-purpose and/or replicate our approach1. We note that utiliz-...

work page 2015

[7] [7]

- the first spot may receive up to 30% of all traffic, with the top three spots receiving 60% of all traffic (Insights 2013). Selecting Queries In the search literature – and certainly in the search auditing literature – deciding on a set of queries for an analysis is well-known to be challenging (Pan et al. 2007; McMahon, Johnson, and Hecht 2017; Hannak ...

work page 2013

[8] [8]

trending searches

to se-lect queries in a systematic way. For each of the three di-mensions above, we developed two separate categories of queries, leading to six total query categories. Each query cat-egory contains between 10-20 queries, a number selected to be practical with respect to the rate limit we imposed to avoid excessive querying. By considering three different...

work page 2018

[9] [9]

Ex-plore

and, as such, we selected these two cate-gories to represent high-revenue queries. To populate these categories with actual queries, we used Google Trends’ “Ex-plore” feature to obtain the top ten queries for “insurance” and for “loans” (in the U.S., from all of 2017). We used Google AdWords’ Keyword Planner to verify that the bids for these query categor...

work page 2017

[10] [10]

ten blue links

Influential Queries: Query popularity and query revenue do not necessarily correlate strongly with the influence of a SERP on people’s lives. Some types of queries – e.g. queries related to a family member’s serious illness or queries re-lated to informing one’s political views – can have an out-sized impact (Epstein and Robertson 2015; Soldaini et al. 20...

work page 2015

[11] [11]

blue links

that for-mally comprised the canonical search results page. Current SERPs contain multiple columns of content, and items like carousels (which have multiple links per row), answer boxes, and more. To understand the prominence of UGC on Google SERPs, it was important that we account for all this complexity. As such, in addition to standard “blue links”, ou...

work page 2010

[12] [12]

creative

A screenshot depicting a selection of elements on Google SERPs. “creative” (i.e. not a copy of some other content) and (2) if the content appeared to be authored outside of professional “routines and practices”. Coders used contextual infor-mation such as Twitter biographies or the presence of user reviews to judge whether the content appeared to be “pro-...

work page 2017

[13] [13]

Hecht and Stephens 2014; Mahmud, Nich-ols, and Drews 2012; Jurgens et al

and computational social science (as well as many other fields) (e.g. Hecht and Stephens 2014; Mahmud, Nich-ols, and Drews 2012; Jurgens et al. 2015). We discuss pos-sible expansions of this work to different geographic con-texts in Future Work. To generate specific geographic coordinates for the strat-egy outlined above, we used the following approach:

work page 2014

[14] [14]

National Center for Health Statistics (NCHS) (In-gram and Franco 2014), we sampled 10 counties from the most urban and most rural classes

Urban-Rural: Using the urban-rural classifications by the U.S. National Center for Health Statistics (NCHS) (In-gram and Franco 2014), we sampled 10 counties from the most urban and most rural classes. These NCHS classifi-cations are often leveraged in GeoHCI examining rural-urban issues (Colley et al. 2017; Thebault-Spieker, Hecht, and Terveen 2018; John...

work page 2014

[15] [15]

Census American Community Survey 5-Year Estimates (U.S

Income: We selected the top and bottom 10 counties in terms of 2015 median income, according to the 2011-2015 U.S. Census American Community Survey 5-Year Estimates (U.S. Census Bureau 2011), and executed the county-to-coordinate mapping as described above

work page 2015

[16] [16]

Presidential election and again executed the same county-to-coordinate mapping

Voting: We selected the top and bottom 10 counties in terms of percentage of votes for Hillary Clinton in the 2016 U.S. Presidential election and again executed the same county-to-coordinate mapping. This county-level data was published by Townhall (Townhall.com

work page 2016

[17] [17]

and accessed via McGovern's repository (2017). Population-weighted Experiment: As reported below in Results, the rigorous geographic comparisons described above showed little evidence of geographic variation in met-rics of interest. As such, it was reasonable to use a single set of query locations to report our results. However, it was non-optimal to sele...

work page 2017

[18] [18]

life insurance,

On the other hand, an example of a query for which Wikipedia is less important is the query “life insurance,” where Wikipedia showed up at rank nine. Beyond Wikipedia, Figure 1 additionally shows that Twitter is also important to Google’s ability to respond to queries in many of our categories. For instance, for most-popular and trending queries, the full...

work page 2018

[19] [19]

Mads-bjerg 2017; Posner and Weyl 2018; Porter 2018; Kugler 2018)

and into mainstream debate (e.g. Mads-bjerg 2017; Posner and Weyl 2018; Porter 2018; Kugler 2018). This discussion centers on potential asymmetries in the relationship between users and lucrative intelligent tech-nologies: user-generated data is immensely important to such technologies, but many argue that users are not receiv-ing a proportional share of ...

work page 2017

[20] [20]

data labor

have iden-tified information imbalances between intelligent technol-ogy owners and data creators as a key mechanism for the current distribution of economic benefits of intelligent tech-nologies. While the developers of intelligent technologies know many such technologies would struggle substantially without constant “data labor” by their users and others...

work page 2017

[21] [21]

After all, most Wikipedia editors benefit heav-ily from their use of Google, and McMahon et al

have noted, the discussion about the distribution of the technological dividend must also consider the value of the service that intelligent technologies “trade” for data-gen-erating labor. After all, most Wikipedia editors benefit heav-ily from their use of Google, and McMahon et al. showed that Wikipedia itself does as well (McMahon, Johnson, and Hecht ...

work page 2017

[22] [22]

data strikes

have suggested that collective action by users – e.g. through boycotts, “data strikes”, or data unions – can be one possible solution. Indeed, recent research has high-lighted the potential impact that data strikes, boycotts, or combinations thereof could have on intelligent technologies (Vincent, Hecht, and Sen 2019). However, other, less con-frontationa...

work page 2019

[23] [23]

Zhu, Kraut, and Kittur 2012; Zhu et al

to the collaboration patterns between editors that lead to the highest-quality content (e.g. Zhu, Kraut, and Kittur 2012; Zhu et al. 2013). Our results further bolster the importance of this literature by showing that the literature’s findings have implications far beyond the boundaries of Wikipedia. For instance, prior work has shown that the English Wik...

work page 2012

[24] [24]

and similar patterns have been observed with respect to Wikipedia’s coverage of some geographic areas versus others (Johnson et al. 2016). Our results highlight that not only do these biases affect reader experience on Wikipedia, they also affect Google’s ability to address information needs associated with the disadvantaged topics. That is, if Wikipedia ...

work page 2016

[25] [25]

– filtering out organizational and other professional accounts will be more difficult and is deserving of further research along the lines of McCorriston et al. (2015). Geographic Personalization and UGC Our geographic comparisons suggest that personalization based on geographic location may be non-substantial for certain types of search phenomena. This m...

work page 2015

[26] [26]

Future studies should address this directly

and that platforms like Twitter are not equally popular in all countries (Schoonderwoerd 2013), ge-ography likely matters across national and linguistic bor-ders. Future studies should address this directly. Limitations As is typical in the search auditing literature, although we aimed to generate queries systemically, the immense num-ber of search engine...

work page 2013

[27] [27]

Knowledge Vault (Dong et al

or structured knowledge domains (e.g. Knowledge Vault (Dong et al. 2014)). Indeed, the introduction of these tech-nologies may be responsible for the decrease we observed in Wikipedia full-page incidence rate for medical queries relative to the work of Laurent and Vickers (2009) last dec-ade (although the methods are not directly comparable). Do-ing this ...

work page 2014

[28] [28]

www.bloomberg.com/news/articles/2015-10-26/google-turning-its-lucrative-web-search-over-to-ai-machines

Google Turning Its Lucrative Web Search Over to AI Machines. www.bloomberg.com/news/articles/2015-10-26/google-turning-its-lucrative-web-search-over-to-ai-machines. Cohen, R.; and Ruths, D

work page 2015

[29] [29]

Insights, Chitika

2013 NCHS Urban-Rural Clas-sification Scheme for Counties. Insights, Chitika

work page 2013

[30] [30]

The War over the Value of Personal Data

“The War over the Value of Personal Data.” Com-mun. ACM 61 (2): 17–19. https://doi.org/10.1145/3171580. Kulshrestha, J.; Eslami, M.; Messias, J.; Zafar, M. B.; Ghosh, S.; Gummadi, L.; and Karahalios, K

work page doi:10.1145/3171580

[31] [31]

nytimes.com/2017/11/14/business/dealbook/tax-ing-companies-for-using-our-personal-data.html Mahmud, J.; Nichols, J.; and Drews, C

It’s Time to Tax Companies for Using Our Personal Data. nytimes.com/2017/11/14/business/dealbook/tax-ing-companies-for-using-our-personal-data.html Mahmud, J.; Nichols, J.; and Drews, C

work page 2017

[32] [32]

mashable.com/2010/02/16/google-wikipedia-donation Pfeil, U.; Zaphiris, P.; and Ang, C

Google Gives $2 Million to Wikipedia’s Founda-tion. mashable.com/2010/02/16/google-wikipedia-donation Pfeil, U.; Zaphiris, P.; and Ang, C. S

work page 2010

[33] [33]

Shouldn’t You Be Paid for It? https://www.nytimes.com/2018/03/06/busi-ness/economy/user-data-pay.html

Your Data Is Crucial to a Robotic Age. Shouldn’t You Be Paid for It? https://www.nytimes.com/2018/03/06/busi-ness/economy/user-data-pay.html. Posner, E. A.; and Weyl, E. G

work page 2018

[34] [34]

www.washing-tonpost.com/news/the-switch/wp/2018/02/01/google-parent-al-phabet-reports-soaring-ad-revenue-despite-youtube-backlash

Google Parent Alphabet Reports Soaring Ad Revenue, despite YouTube Backlash. www.washing-tonpost.com/news/the-switch/wp/2018/02/01/google-parent-al-phabet-reports-soaring-ad-revenue-despite-youtube-backlash. Shivar, N

work page 2018

[35] [35]

https://town-hall.com/election/2016/president/

Election 2016 Results Map. https://town-hall.com/election/2016/president/. Townsend, T

work page 2016

[36] [36]

https://factfinder.census.gov

2011-2015 American Community Sur-vey 5-Year Estimates. https://factfinder.census.gov. Van Deursen, A. J. A. M.; and Van Dijk, J. A. G. M

work page 2011