A Penny for Your Prompts: Experiments Detecting and Mitigating LLM Usage by Survey Respondents

Nathan Malkin; Zane Xu

arxiv: 2607.00403 · v1 · pith:UWFHPG6Inew · submitted 2026-07-01 · 💻 cs.HC · cs.CR· cs.CY

A Penny for Your Prompts: Experiments Detecting and Mitigating LLM Usage by Survey Respondents

Zane Xu , Nathan Malkin This is my paper

Pith reviewed 2026-07-02 06:56 UTC · model grok-4.3

classification 💻 cs.HC cs.CRcs.CY

keywords LLM detectionsurvey responsescrowdsourcing platformsMechanical TurkProlificdata qualitymitigationkeystroke analysis

0 comments

The pith

LLM-assisted responses appear in under 10% of Prolific surveys but over 80% on Mechanical Turk, and can be detected through response characteristics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how frequently survey respondents on crowdsourcing platforms rely on large language models to generate answers. It identifies clear patterns that mark AI-assisted replies and measures how often they occur across different sites. Varying survey length, adding warnings against AI, and blocking copy-paste all lowered the rate of LLM use. These steps did not reliably raise the overall quality of the collected data. The authors conclude that researchers should log keystrokes and design questions to catch LLM involvement.

Core claim

Experiments with 250 participants showed that LLM-assisted survey responses exhibit distinct characteristics allowing detection, with prevalence ranging from under 10% on Prolific to over 80% on Mechanical Turk. Mitigation measures such as requests not to use AI and disabling copy-paste reduced LLM usage but did not necessarily improve data quality. No participants used browser agents, yet the study reports detection experiments and recommends keystroke recording plus targeted instructions and questions to screen for LLM assistance.

What carries the argument

Detection of LLM assistance through distinct response characteristics, tested via controlled variations in platform, survey length, anti-AI instructions, and copy-paste restrictions.

If this is right

Platform choice strongly affects the baseline rate of LLM-assisted survey responses.
Anti-AI instructions and copy-paste blocks lower the frequency of LLM use.
Lower LLM usage does not automatically produce higher-quality survey data.
Keystroke data provides a practical way to screen responses for LLM involvement.
Carefully worded instructions and questions can help both detect and discourage AI use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many existing online studies may contain undetected LLM-generated data whose impact on conclusions remains unknown.
Recruitment platforms could add built-in keystroke logging as a standard feature for researchers.
The same detection approach might extend to other online data sources such as interviews or open-ended feedback forms.

Load-bearing premise

The observed response characteristics are caused by LLM assistance rather than fatigue, language differences, or platform-specific user pools.

What would settle it

A side-by-side test of known human-written answers and known LLM-generated answers on the same survey questions to check whether the reported distinguishing characteristics separate the two groups reliably.

read the original abstract

Large language models are increasingly used by participants on crowdsourcing platforms when responding to surveys, potentially undermining the validity of collected data. Our study aims to quantify the prevalence of this behavior and investigate methods to detect and prevent it. In a series of surveys (N = 250), we examined conditions such as platform choice, survey length, requests not to use AI, and disabling copy-paste functionality. We were able to identify distinct characteristics of LLM-assisted responses and found that their frequency varied widely, from under 10% on Prolific to over 80% on Mechanical Turk. Mitigation measures reduced LLM usage but did not necessarily improve data quality. No participants employed browser-use agents at the time of our survey, but we report on our own detection experiments. We recommend that researchers actively screen survey responses for LLM usage by recording and analyzing keystroke data and crafting instructions and questions aimed at AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives platform-specific rates of LLM use in surveys and tests simple mitigations, but the detection approach is underspecified.

read the letter

The core finding is that LLM-assisted responses show up at very different rates depending on the platform—under 10% on Prolific versus over 80% on Mechanical Turk—and that turning off copy-paste or adding explicit instructions cuts the rate but does not always raise data quality. Those numbers and the mitigation tests are the new pieces.

The work is straightforward empirical measurement. It takes a real problem that affects anyone running online studies and supplies concrete prevalence estimates plus quick checks on two common fixes. That is useful for the HCI and social-science crowd that relies on these platforms.

The main gap is in how they actually spotted the LLM responses. The abstract says they identified distinct characteristics, but it gives no criteria, no validation against known LLM output, no inter-rater checks, and no statistical tests or error bars. Without those details it is difficult to judge whether the reported rates reflect LLM use or other factors such as answer length or participant pool differences. The N=250 is also small once split across conditions.

This is the kind of paper that belongs in a methods or applied HCI venue. A serious referee should see it because the question matters and the platform contrast is worth publishing, but the methods section will need substantial expansion before it can be trusted. I would bring it to a reading group for the numbers alone and would cite the prevalence claims once the detection pipeline is clearer.

Referee Report

3 major / 2 minor

Summary. The manuscript reports results from a series of surveys (N=250) conducted on crowdsourcing platforms to quantify the prevalence of LLM-assisted responses, identify distinguishing characteristics of such responses, and test mitigation strategies including platform choice, survey length, explicit instructions against AI use, and disabling copy-paste. Key findings include platform-level prevalence differences (under 10% on Prolific vs. over 80% on Mechanical Turk), reduction in LLM usage from mitigations without consistent gains in data quality, absence of browser-agent use by participants, and a recommendation to screen responses via keystroke logging and targeted question design.

Significance. If the detection criteria and classification procedures prove reliable after clarification, the work would be significant for survey methodology and crowdsourced data collection. It provides empirical evidence of platform-specific differences in LLM usage and practical mitigation tests, directly relevant to researchers relying on platforms like MTurk and Prolific. The emphasis on keystroke data as a detection tool offers a concrete, implementable suggestion.

major comments (3)

[Abstract / Methods] Abstract and Methods (inferred from reported N=250 and frequency claims): the exact criteria, features, or classification rules used to label responses as LLM-assisted are not specified, nor is any ground-truth validation, inter-rater reliability, or control for confounds such as participant fatigue or language background. This is load-bearing for the central prevalence estimates and platform comparisons.
[Results] Results (frequency differences): no statistical tests, confidence intervals, error bars, or sample-size breakdowns per platform/condition are reported despite claims of clear differences (e.g., <10% vs. >80%). Without these, the robustness of the platform effect cannot be assessed.
[Results / Discussion] Mitigation experiments: the claim that mitigations 'reduced LLM usage but did not necessarily improve data quality' requires explicit operationalization of 'data quality' and pre/post metrics; none are described, leaving the dissociation between usage reduction and quality untestable.

minor comments (2)

[Abstract] The abstract states N=250 but does not break down the allocation across the multiple conditions (platform, length, instructions, copy-paste); a table or explicit counts would improve transparency.
[Abstract] The statement that 'no participants employed browser-use agents' is presented without describing how this was verified or the scope of the detection experiments mentioned.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for improving the clarity and rigor of our work. We address each major comment below and commit to revisions where needed.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods (inferred from reported N=250 and frequency claims): the exact criteria, features, or classification rules used to label responses as LLM-assisted are not specified, nor is any ground-truth validation, inter-rater reliability, or control for confounds such as participant fatigue or language background. This is load-bearing for the central prevalence estimates and platform comparisons.

Authors: We acknowledge that the submitted manuscript does not provide sufficient detail on the classification criteria. In the revision, we will expand the Methods section with a dedicated subsection explicitly listing the features and rules used to identify LLM-assisted responses, describing our ground-truth validation approach from controlled experiments, reporting any inter-rater reliability, and addressing potential confounds including participant fatigue and language background. revision: yes
Referee: [Results] Results (frequency differences): no statistical tests, confidence intervals, error bars, or sample-size breakdowns per platform/condition are reported despite claims of clear differences (e.g., <10% vs. >80%). Without these, the robustness of the platform effect cannot be assessed.

Authors: We agree that the Results section lacks the necessary statistical support. We will revise it to include appropriate statistical tests for the reported platform differences, add confidence intervals and error bars, and provide full sample-size breakdowns by platform and condition. revision: yes
Referee: [Results / Discussion] Mitigation experiments: the claim that mitigations 'reduced LLM usage but did not necessarily improve data quality' requires explicit operationalization of 'data quality' and pre/post metrics; none are described, leaving the dissociation between usage reduction and quality untestable.

Authors: This point is well taken. We will revise the manuscript to explicitly define and operationalize 'data quality' using specific metrics, and report pre- and post-mitigation comparisons on those metrics to substantiate the observed dissociation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical study

full rationale

This paper reports an empirical measurement study based on surveys (N=250) across platforms, testing conditions like platform choice, survey length, and mitigation measures. No derivations, equations, fitted parameters, or first-principles results are present that could reduce outputs to inputs by construction. Claims about LLM-response characteristics and prevalence differences rest on direct experimental observations and keystroke analysis rather than self-definitional loops, fitted-input predictions, or load-bearing self-citations. The work is self-contained as an experimental report with no mathematical chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms; relies on standard assumptions about survey response behavior and observable text features.

axioms (1)

domain assumption Observable text and interaction features can distinguish LLM-generated from human responses under the tested conditions
Central to the identification of distinct characteristics reported in the abstract.

pith-pipeline@v0.9.1-grok · 5685 in / 1119 out tokens · 28514 ms · 2026-07-02T06:56:49.383143+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

125 extracted references · 41 canonical work pages · 1 internal anchor

[1]

https://help.x.com/en/using-x/ab out-grok

About Grok. https://help.x.com/en/using-x/ab out-grok. Accessed: 2026-02-16

2026
[2]

https://www.mturk.com/

Amazon Mechanical Turk. https://www.mturk.com/. Accessed: 2025-12-12

2025
[3]

https: //www.cloudresearch.com/

CloudResearch - Online Research Platform. https: //www.cloudresearch.com/. Accessed: 2026-01-07

2026
[4]

https://en.wikipedia.org/w/i ndex.php?title=Comet_(browser)&oldid=13359 36939

Comet (browser). https://en.wikipedia.org/w/i ndex.php?title=Comet_(browser)&oldid=13359 36939. Accessed: 2026-02-03

2026
[5]

https://support.claude.com/en/arti cles/12012173-getting-started-with-claude- in-chrome

Getting Started with Claude in Chrome | Claude Help Center. https://support.claude.com/en/arti cles/12012173-getting-started-with-claude- in-chrome. Accessed: 2026-01-05

work page arXiv 2026
[6]

https://researcher-help.pro lific.com/en/articles/445146-how-to-add-au thenticity-checks-to-your-qualtrics-study

How to add authenticity checks to your Qualtrics study | Prolific Research. https://researcher-help.pro lific.com/en/articles/445146-how-to-add-au thenticity-checks-to-your-qualtrics-study . Accessed: 2026-02-20

2026
[7]

https://ai.goo gle.dev/gemini-api/docs/image-understandin g

Image understanding | Gemini API. https://ai.goo gle.dev/gemini-api/docs/image-understandin g. Accessed: 2026-02-16

2026
[8]

https://ww w.limesurvey.org/

LimeSurvey — Free Online Survey Tool. https://ww w.limesurvey.org/. Accessed: 2026-02-20

2026
[9]

https://web.respondus.com/ he/lockdownbrowser/

LockDown Browser. https://web.respondus.com/ he/lockdownbrowser/. Accessed: 2026-02-20

2026
[10]

https://blog.google/products-and- platforms/products/chrome/gemini-3-auto- browse/

The new era of browsing: Putting Gemini to work in Chrome. https://blog.google/products-and- platforms/products/chrome/gemini-3-auto- browse/. Accessed: 2026-02-03

2026
[11]

https://www.prolific.com

Prolific | Easily collect high-quality data from real peo- ple. https://www.prolific.com . Accessed: 2025- 12-12

2025
[12]

https://researcher-help.p rolific.com/en/articles/445153-prolific-s- attention-and-comprehension-check-policy

Prolific’s Attention and Comprehension Check Policy | Prolific Research. https://researcher-help.p rolific.com/en/articles/445153-prolific-s- attention-and-comprehension-check-policy . Accessed: 2026-02-18

2026
[13]

https://huggingface.co/sentence- transformers/all-MiniLM-L6-v2 , January 2024

Sentence-transformers/all-MiniLM-L6-v2 · Hugging Face. https://huggingface.co/sentence- transformers/all-MiniLM-L6-v2 , January 2024. Accessed: 2026-01-20

2024
[14]

https://deepmind.google/models /gemini/pro/, November 2025

Gemini 3 Pro. https://deepmind.google/models /gemini/pro/, November 2025. Accessed: 2025-12- 12

2025
[15]

https://www.anth ropic.com/news/claude-sonnet-4-5 , September

Introducing Claude Sonnet 4.5. https://www.anth ropic.com/news/claude-sonnet-4-5 , September
[16]

Accessed: 2025-12-12

2025
[17]

https://openai.com/index/i ntroducing-gpt-5/ , August 2025

Introducing GPT-5. https://openai.com/index/i ntroducing-gpt-5/ , August 2025. Accessed: 2026- 02-20

2025
[18]

AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances

Dhruv Agarwal, Mor Naaman, and Aditya Vashistha. AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. InProceed- ings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, pages 1–21. Associ- ation for Computing Machinery, April 2025. https: //doi.org/10.1145/3706598.3713564

work page doi:10.1145/3706598.3713564 2025
[19]

Ahler, Carolyn E

Douglas J. Ahler, Carolyn E. Roush, and Gaurav Sood. The micro-task market for lemons: Data quality on Ama- zon’s Mechanical Turk.Political Science Research and Methods, 13(1):1–20, January 2025

2025
[20]

Measuring the Impact of Early-2025 AI on Ex- perienced Open-Source Developer Productivity

Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the Impact of Early-2025 AI on Ex- perienced Open-Source Developer Productivity. ht tp://arxiv.org/abs/2507.09089 , July 2025. arXiv:2507.09089

work page arXiv 2025
[21]

So You Want to Conduct an Online Survey? Strategies for Identifying and Eliminating Fraudulent Responses.International Journal of Integrated Care, 25(S2):142, August 2025

Isabelle Caven, Zhenxiao Yang, Marianne Saragosa, Yona Lunsky, Jill Cameron, Kristine Newman, Sue Bookey-Bassett, Shoshana Hahn-Goldberg, and Karen Okrainec. So You Want to Conduct an Online Survey? Strategies for Identifying and Eliminating Fraudulent Responses.International Journal of Integrated Care, 25(S2):142, August 2025. https://ijic.org/artic les/...

work page doi:10.5334/ijic.nacic24142 2025
[22]

Nonnaïveté among Amazon Mechanical Turk work- ers: Consequences and solutions for behavioral re- searchers.Behavior Research Methods, 46(1):112–130, March 2014

Jesse Chandler, Pam Mueller, and Gabriele Paolacci. Nonnaïveté among Amazon Mechanical Turk work- ers: Consequences and solutions for behavioral re- searchers.Behavior Research Methods, 46(1):112–130, March 2014. https://doi.org/10.3758/s13428- 013-0365-7

work page doi:10.3758/s13428- 2014
[23]

Identifying Bots Through LLM-Generated Text in Open Narrative Responses: A Proof-of-Concept Study.Social Science Computer Re- view, page 08944393251408022, January 2026

Joshua Claassen, Jan Karem Höhne, Ruben Bach, and Anna-Carolina Haensch. Identifying Bots Through LLM-Generated Text in Open Narrative Responses: A Proof-of-Concept Study.Social Science Computer Re- view, page 08944393251408022, January 2026. https: //journals.sagepub.com/doi/10.1177/0894439 3251408022

work page doi:10.1177/0894439 2026
[24]

V oices from the algorithm: Large language models in social research

Emily Cox, Fiona Shirani, and Paul Rouse. V oices from the algorithm: Large language models in social research. Energy Research & Social Science, 113:103559, July
[25]

https://www.sciencedirect.com/science/ article/pii/S2214629624001506
[26]

Questioning the Survey Re- sponses of Large Language Models.Advances in Neural Information Processing Systems, 37:45850–45878, De- cember 2024

Ricardo Dominguez-Olmedo, Moritz Hardt, and Ce- lestine Mendler-Dünner. Questioning the Survey Re- sponses of Large Language Models.Advances in Neural Information Processing Systems, 37:45850–45878, De- cember 2024

2024
[27]

Doshi and Oliver P

Anil R. Doshi and Oliver P. Hauser. Generative AI enhances individual creativity but reduces the collec- tive diversity of novel content.Science Advances, 10(28):eadn5290, July 2024. https://www.science. org/doi/10.1126/sciadv.adn5290

work page doi:10.1126/sciadv.adn5290 2024
[28]

Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses.International Review of Economics Education, 49:100321, June 2025

Alexandra Fiedler and Jörg Döpke. Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses.International Review of Economics Education, 49:100321, June 2025. https: //www.sciencedirect.com/science/article/pi i/S1477388025000131

2025
[29]

R. A. Fisher. The Logic of Inductive Inference.Journal of the Royal Statistical Society, 98(1):39–82, 1935. ht tps://www.jstor.org/stable/2342435

work page arXiv 1935
[30]

Too Fast

Robert Greszki, Marco Meyer, and Harald Schoen. Ex- ploring the Effects of Removing “Too Fast” Responses and Respondents from Web Surveys.Public Opin- ion Quarterly, 79(2):471–503, January 2015. https: //doi.org/10.1093/poq/nfu058

work page doi:10.1093/poq/nfu058 2015
[31]

J. B. Haldane. The estimation and significance of the logarithm of a ratio of frequencies.Annals of Human Genetics, 20(4):309–311, May 1956

1956
[32]

The political ideology of conversational AI: Con- verging evidence on ChatGPT’s pro-environmental, left- libertarian orientation

Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. The political ideology of conversational AI: Con- verging evidence on ChatGPT’s pro-environmental, left- libertarian orientation. http://arxiv.org/abs/2301 .01768, January 2023. arXiv:2301.01768

work page arXiv 2023
[33]

Bots in web survey interviews: A showcase.International Journal of Market Research, 67(1):3–12, January 2025

Jan Karem Höhne, Joshua Claassen, Saijal Shahania, and David Broneske. Bots in web survey interviews: A showcase.International Journal of Market Research, 67(1):3–12, January 2025. https://journals.sagep ub.com/doi/10.1177/14707853241297009

work page doi:10.1177/14707853241297009 2025
[34]

Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines

Paul Jaccard. Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles, 37:241–72, January 1901

1901
[35]

Nicolas Jacquemet, Stéphane Luchini, Julie Rosaz, and Jason F. Shogren. Truth Telling Under Oath.Manage- ment Science, 65(1):426–438, January 2019. https: //pubsonline.informs.org/doi/abs/10.1287/m nsc.2017.2892

work page doi:10.1287/m 2019
[36]

Exhaustive or exhausting? Evidence on respondent fa- tigue in long surveys.Journal of Development Eco- nomics, 161:102992, March 2023

Dahyeon Jeong, Shilpa Aggarwal, Jonathan Robinson, Naresh Kumar, Alan Spearot, and David Sungho Park. Exhaustive or exhausting? Evidence on respondent fa- tigue in long surveys.Journal of Development Eco- nomics, 161:102992, March 2023. https://www.scie ncedirect.com/science/article/pii/S0304387 822001341

2023
[37]

Udo-Imeh, Bonan Kou, and Tianyi Zhang

Samia Kabir, David N. Udo-Imeh, Bonan Kou, and Tianyi Zhang. Is Stack Overflow Obsolete? An Em- pirical Study of the Characteristics of ChatGPT An- swers to Stack Overflow Questions. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, pages 1–17. Associa- tion for Computing Machinery, May 2024. https: //dl.acm.org/doi/...

work page doi:10.1145/3613904.3642596 2024
[38]

Caring about Sharing: User Percep- tions of Multiparty Data Sharing

Bailey Kacsmar, Kyle Tilbury, Miti Mazmudar, and Flo- rian Kerschbaum. Caring about Sharing: User Percep- tions of Multiparty Data Sharing. In31st USENIX Secu- rity Symposium (USENIX Security 22), pages 899–916,
[39]

https://www.usenix.org/conference/usen ixsecurity22/presentation/kacsmar
[40]

Cameron S. Kay. Why you shouldn’t trust data collected on MTurk.Behavior Research Methods, 57(12):340, November 2025. https://doi.org/10.3758/s134 28-025-02852-7

work page doi:10.3758/s134 2025
[41]

Keith, Louis Tay, and Peter D

Melissa G. Keith, Louis Tay, and Peter D. Harms. Sys- tems Perspective of Amazon Mechanical Turk for Or- ganizational Research: Review and Recommendations. Frontiers in Psychology, 8:1359, August 2017. https: //www.frontiersin.org/journals/psychology/ articles/10.3389/fpsyg.2017.01359/full

work page doi:10.3389/fpsyg.2017.01359/full 2017
[42]

Waggoner, Ryan Jewell, and Nicholas J

Ryan Kennedy, Scott Clifford, Tyler Burleigh, Philip D. Waggoner, Ryan Jewell, and Nicholas J. G. Winter. The shape of and solutions to the MTurk quality crisis.Po- litical Science Research and Methods, 8(4):614–629, October 2020. https://www.cambridge.org/core /journals/political-science-research-and- methods/article/shape-of-and-solutions-to- the-mturk-...

2020
[43]

Delving into LLM-assisted writ- ing in biomedical publications through excess vocab- ulary.Science Advances, 11(27):eadt3813, July 2025

Dmitry Kobak, Rita González-Márquez, Em˝oke-Ágnes Horvát, and Jan Lause. Delving into LLM-assisted writ- ing in biomedical publications through excess vocab- ulary.Science Advances, 11(27):eadt3813, July 2025. https://www.science.org/doi/full/10.1126/s ciadv.adt3813

work page doi:10.1126/s 2025
[44]

SAGE Publications, May 2018

Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. SAGE Publications, May 2018

2018
[45]

Kupper and Kerry b

Lawrence L. Kupper and Kerry b. Hafner. On Assessing Interrater Agreement for Multiple Attribute Responses. Biometrics, 45(3):957–967, 1989. https://www.jsto r.org/stable/2531695

work page arXiv 1989
[46]

Detecting the corruption of online questionnaires by artificial intelligence.Frontiers in Robotics and AI, 10:1277635, February 2024

Benjamin Lebrun, Sharon Temtsin, Andrew V onasch, and Christoph Bartneck. Detecting the corruption of online questionnaires by artificial intelligence.Frontiers in Robotics and AI, 10:1277635, February 2024. https: //www.frontiersin.org/articles/10.3389/fro bt.2023.1277635/full

work page doi:10.3389/fro 2024
[47]

GPT detectors are biased against non- native English writers.Patterns, 4(7), July 2023

Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. GPT detectors are biased against non- native English writers.Patterns, 4(7), July 2023. ht tps://www.cell.com/patterns/abstract/S2666- 3899(23)00130-7

2023
[48]

H. B. Mann and D. R. Whitney. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other.The Annals of Mathematical Statistics, 18(1):50–60, 1947. https://www.jstor.org/stab le/2236101

work page arXiv 1947
[49]

Challenges and Opportunities for Survey Research in the Age of Generative AI: An Experience Report

Fairuz Nawer Meem, Justin Smith, and Brittany John- son. Challenges and Opportunities for Survey Research in the Age of Generative AI: An Experience Report. In2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pages 423–428, September 2024. https://ieeexplore.ieee.org/ document/10714567

work page arXiv 2024
[50]

Contrasting Linguistic Patterns in Human and LLM-Generated News Text.Artificial Intelligence Review, 57(10):265, August 2024

Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, and David Vilares. Contrasting Linguistic Patterns in Human and LLM-Generated News Text.Artificial Intelligence Review, 57(10):265, August 2024. https://doi.org/ 10.1007/s10462-024-10903-2

work page doi:10.1007/s10462-024-10903-2 2024
[51]

Ipeirotis

Gabriele Paolacci, Jesse Chandler, and Panagiotis G. Ipeirotis. Running experiments on Amazon Mechanical Turk.Judgment and Decision Making, 5(5):411–419, August 2010. https://www.cambridge.org/core/j ournals/judgment-and-decision-making/artic le/running-experiments-on-amazon-mechanic al-turk/BBD787F3B4DDB61119CBB215927CA39E

2010
[52]

Karl Pearson. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling. In Samuel Kotz and Norman L. Johnson, editors,Break- throughs in Statistics: Methodology and Distribution, pages 11–28. Springer, New York, NY , 1...

work page doi:10.1007/978-1-4612-4380-9_2 1992
[53]

Data quality of plat- forms and panels for online behavioral research.Behav- ior Research Methods, 54(4):1643–1662, August 2022

Eyal Peer, David Rothschild, Andrew Gordon, Zak Ev- ernden, and Ekaterina Damer. Data quality of plat- forms and panels for online behavioral research.Behav- ior Research Methods, 54(4):1643–1662, August 2022. https://doi.org/10.3758/s13428-021-01694-3

work page doi:10.3758/s13428-021-01694-3 2022
[54]

Argasi ´nski, Iwona Grabska- Gradzi´nska, and Jeremi K

Karol Przystalski, Jan K. Argasi ´nski, Iwona Grabska- Gradzi´nska, and Jeremi K. Ochab. Stylometry recog- nizes human and LLM-generated texts in short samples. Expert Systems with Applications, 296:129001, January
[55]

https://www.sciencedirect.com/science/ article/pii/S0957417425026181
[56]

Calpric: In- clusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning

Wenjun Qiu, David Lie, and Lisa Austin. Calpric: In- clusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning. In32nd USENIX Security Symposium (USENIX Security 23), pages 1055– 1072, 2023. https://www.usenix.org/conferenc e/usenixsecurity23/presentation/qiu

2023
[57]

Human Perception of LLM- generated Text Content in Social Media Environments

Kristina Radivojevic, Matthew Chou, Karla Badillo- Urquiola, and Paul Brenner. Human Perception of LLM- generated Text Content in Social Media Environments. http://arxiv.org/abs/2409.06653 , September

work page arXiv
[58]

Do LLMs write like hu- mans? Variation in grammatical and rhetorical styles

Alex Reinhart, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, Gordon Weinberg, and David West Brown. Do LLMs write like hu- mans? Variation in grammatical and rhetorical styles. Proceedings of the National Academy of Sciences, 122(8):e2422455122, February 2025. https://www. pnas.org/doi/abs/10.1073/pnas.2422455122

work page doi:10.1073/pnas.2422455122 2025
[59]

Recognising, Antic- ipating, and Mitigating LLM Pollution of Online Be- havioural Research

Raluca Rilla, Tobias Werner, Hiromu Yakura, Iyad Rah- wan, and Anne-Marie Nussberger. Recognising, Antic- ipating, and Mitigating LLM Pollution of Online Be- havioural Research. http://arxiv.org/abs/2508.0 1390, November 2025. arXiv:2508.01390

work page arXiv 2025
[60]

Ritchey, Corina Jimenez-Gomez, and Christopher A

Carolyn M. Ritchey, Corina Jimenez-Gomez, and Christopher A. Podlesnik. Effects of pay rate and in- structions on attrition in crowdsourcing research.PLOS ONE, 18(10):e0292372, October 2023. https://pmc. ncbi.nlm.nih.gov/articles/PMC10550147/

2023
[61]

The political preferences of LLMs

David Rozado. The political preferences of LLMs. PLOS ONE, 19(7):e0306621, July 2024. https:// journals.plos.org/plosone/article?id=10.13 71/journal.pone.0306621

2024
[62]

SAGE Publications Ltd, fifth edition, March 2025

Johnny Saldana.The Coding Manual for Qualitative Re- searchers. SAGE Publications Ltd, fifth edition, March 2025

2025
[63]

Whose Opinions Do Language Models Reflect? InProceedings of the 40th International Conference on Machine Learn- ing, pages 29971–30004

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose Opinions Do Language Models Reflect? InProceedings of the 40th International Conference on Machine Learn- ing, pages 29971–30004. PMLR, July 2023. https: //proceedings.mlr.press/v202/santurkar23a. html

2023
[64]

It’s Something to Polish Your Own Thoughts, Rather than Create Thoughts for You

Steven Schirra, Sasha G V olkov, and Frank Bentley. "It’s Something to Polish Your Own Thoughts, Rather than Create Thoughts for You": Understanding Partic- ipants’ Use of Chatbots and LLMs During Online Re- search Participation. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–6. Ass...

work page doi:10.1145/3706599.3720027 2025
[65]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting.http://arxiv. org/abs/2310.11324, July 2024. arXiv:2310.11324

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Gotta Catch ’Em All

Saijal Shahania, Myra Spiliopoulou, and David Broneske. Gotta Catch ’Em All... Or Not?: How LLMs Bypass Traditional Checks & Mimic Human Response Behavior in Web Surveys. InProceedings of the ACM Collective Intelligence Conference, pages 113–128, San Diego CA USA, August 2025. ACM. https: //dl.acm.org/doi/10.1145/3715928.3737491

work page doi:10.1145/3715928.3737491 2025
[67]

Assessing the quality and reliability of the Amazon Mechanical Turk (MTurk) data in 2024.Royal Society Open Science, 12(7):250361, July 2025

Hagar Shimoni and Vadim Axelrod. Assessing the quality and reliability of the Amazon Mechanical Turk (MTurk) data in 2024.Royal Society Open Science, 12(7):250361, July 2025. https://royalsocietypu blishing.org/doi/full/10.1098/rsos.250361

work page doi:10.1098/rsos.250361 2024
[68]

Rectangular Confidence Regions for the Means of Multivariate Normal Distributions.Journal of the American Statistical Association, 62(318):626–633, June 1967

Zbynˇek Šidák. Rectangular Confidence Regions for the Means of Multivariate Normal Distributions.Journal of the American Statistical Association, 62(318):626–633, June 1967. https://doi.org/10.1080/01621459.1 967.10482935

work page doi:10.1080/01621459.1 1967
[69]

Lucy Stafford, Catherine Preston, and Alexandra C. Pike. Participant Use of Artificial Intelligence in Online Focus Groups: An Experiential Account.International Jour- nal of Qualitative Methods, 23:16094069241286417, November 2024. https://doi.org/10.1177/1609 4069241286417

work page doi:10.1177/1609 2024
[70]

Brian Jay Tang and Kang G. Shin. Eye-Shield: Real- Time Protection of Mobile Device Screen Information from Shoulder Surfing. In32nd USENIX Security Sym- posium (USENIX Security 23), pages 5449–5466, 2023. https://www.usenix.org/conference/usenixse curity23/presentation/tang

2023
[71]

Repli- cation: How Well Do My Results Generalize Now? The External Validity of Online Privacy and Security Surveys

Jenny Tang, Eleanor Birrell, and Ada Lerner. Repli- cation: How Well Do My Results Generalize Now? The External Validity of Online Privacy and Security Surveys. InEighteenth Symposium on Usable Pri- vacy and Security (SOUPS 2022), pages 367–385, 2022. https://www.usenix.org/conference/soups202 2/presentation/tang

2022
[72]

The Science of Detecting LLM-Generated Text.Commun

Ruixiang Tang, Yu-Neng Chuang, and Xia Hu. The Science of Detecting LLM-Generated Text.Commun. ACM, 67(4):50–59, March 2024. https://dl.acm.o rg/doi/10.1145/3624725

work page doi:10.1145/3624725 2024
[73]

Do LLMs Exhibit Human-like Response Biases? A Case Study in Sur- vey Design.Transactions of the Association for Com- putational Linguistics, 12:1011–1026, September 2024

Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkwar, and Graham Neubig. Do LLMs Exhibit Human-like Response Biases? A Case Study in Sur- vey Design.Transactions of the Association for Com- putational Linguistics, 12:1011–1026, September 2024. https://doi.org/10.1162/tacl_a_00685

work page doi:10.1162/tacl_a_00685 2024
[74]

The threat of AI chatbot responses to crowdsourced open-ended survey questions.Energy Research & Social Science, 119:103857, January 2025

Frederic Traylor. The threat of AI chatbot responses to crowdsourced open-ended survey questions.Energy Research & Social Science, 119:103857, January 2025. https://linkinghub.elsevier.com/retrieve/p ii/S2214629624004481

2025
[75]

Cozzolino, Andrew Gordon, David Rothschild, and Robert West

Veniamin Veselovsky, Manoel Horta Ribeiro, Philip J. Cozzolino, Andrew Gordon, David Rothschild, and Robert West. Prevalence and Prevention of Large Lan- guage Model Use in Crowd Work.Commun. ACM, 68(3):42–47, February 2025. https://dl.acm.org/d oi/10.1145/3685527

work page doi:10.1145/3685527 2025
[76]

Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks

Veniamin Veselovsky, Manoel Horta Ribeiro, and Robert West. Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks. http://arxiv.org/abs/2306.0 7899, June 2023. arXiv:2306.07899

work page arXiv 2023
[77]

Kumar, and Jason Pridmore

Jessica Vitak, Yuting Liao, Anouk Mols, Daniel Trottier, Michael Zimmer, Priya C. Kumar, and Jason Pridmore. When Do Data Collection and Use Become a Matter of Concern? A Cross-Cultural Comparison of U.S. and Dutch Privacy Attitudes.International Journal of Com- munication, 17(0):28, 2023. https://ijoc.org/ind ex.php/ijoc/article/view/19391

2023
[78]

Webb and June P

Margaret A. Webb and June P. Tangney. Too Good to Be True: Bots and Bad Data From Mechanical Turk. Perspectives on Psychological Science, 19(6):887–890, November 2024. https://doi.org/10.1177/1745 6916221120027

work page doi:10.1177/1745 2024
[79]

Westwood

Sean J. Westwood. The potential existential threat of large language models to online survey research. Proceedings of the National Academy of Sciences, 122(47):e2518075122, November 2025. https://www. pnas.org/doi/full/10.1073/pnas.2518075122

work page doi:10.1073/pnas.2518075122 2025
[80]

Wong, Shu Yang, Xinyi Yang, Yulin Yuan, and Lidia S

Junchao Wu, Runzhe Zhan, Derek F. Wong, Shu Yang, Xinyi Yang, Yulin Yuan, and Lidia S. Chao. Detec- tRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios.Advances in Neural Informa- tion Processing Systems, 37:100369–100401, December 2024

2024

Showing first 80 references.

[1] [1]

https://help.x.com/en/using-x/ab out-grok

About Grok. https://help.x.com/en/using-x/ab out-grok. Accessed: 2026-02-16

2026

[2] [2]

https://www.mturk.com/

Amazon Mechanical Turk. https://www.mturk.com/. Accessed: 2025-12-12

2025

[3] [3]

https: //www.cloudresearch.com/

CloudResearch - Online Research Platform. https: //www.cloudresearch.com/. Accessed: 2026-01-07

2026

[4] [4]

https://en.wikipedia.org/w/i ndex.php?title=Comet_(browser)&oldid=13359 36939

Comet (browser). https://en.wikipedia.org/w/i ndex.php?title=Comet_(browser)&oldid=13359 36939. Accessed: 2026-02-03

2026

[5] [5]

https://support.claude.com/en/arti cles/12012173-getting-started-with-claude- in-chrome

Getting Started with Claude in Chrome | Claude Help Center. https://support.claude.com/en/arti cles/12012173-getting-started-with-claude- in-chrome. Accessed: 2026-01-05

work page arXiv 2026

[6] [6]

https://researcher-help.pro lific.com/en/articles/445146-how-to-add-au thenticity-checks-to-your-qualtrics-study

How to add authenticity checks to your Qualtrics study | Prolific Research. https://researcher-help.pro lific.com/en/articles/445146-how-to-add-au thenticity-checks-to-your-qualtrics-study . Accessed: 2026-02-20

2026

[7] [7]

https://ai.goo gle.dev/gemini-api/docs/image-understandin g

Image understanding | Gemini API. https://ai.goo gle.dev/gemini-api/docs/image-understandin g. Accessed: 2026-02-16

2026

[8] [8]

https://ww w.limesurvey.org/

LimeSurvey — Free Online Survey Tool. https://ww w.limesurvey.org/. Accessed: 2026-02-20

2026

[9] [9]

https://web.respondus.com/ he/lockdownbrowser/

LockDown Browser. https://web.respondus.com/ he/lockdownbrowser/. Accessed: 2026-02-20

2026

[10] [10]

https://blog.google/products-and- platforms/products/chrome/gemini-3-auto- browse/

The new era of browsing: Putting Gemini to work in Chrome. https://blog.google/products-and- platforms/products/chrome/gemini-3-auto- browse/. Accessed: 2026-02-03

2026

[11] [11]

https://www.prolific.com

Prolific | Easily collect high-quality data from real peo- ple. https://www.prolific.com . Accessed: 2025- 12-12

2025

[12] [12]

https://researcher-help.p rolific.com/en/articles/445153-prolific-s- attention-and-comprehension-check-policy

Prolific’s Attention and Comprehension Check Policy | Prolific Research. https://researcher-help.p rolific.com/en/articles/445153-prolific-s- attention-and-comprehension-check-policy . Accessed: 2026-02-18

2026

[13] [13]

https://huggingface.co/sentence- transformers/all-MiniLM-L6-v2 , January 2024

Sentence-transformers/all-MiniLM-L6-v2 · Hugging Face. https://huggingface.co/sentence- transformers/all-MiniLM-L6-v2 , January 2024. Accessed: 2026-01-20

2024

[14] [14]

https://deepmind.google/models /gemini/pro/, November 2025

Gemini 3 Pro. https://deepmind.google/models /gemini/pro/, November 2025. Accessed: 2025-12- 12

2025

[15] [15]

https://www.anth ropic.com/news/claude-sonnet-4-5 , September

Introducing Claude Sonnet 4.5. https://www.anth ropic.com/news/claude-sonnet-4-5 , September

[16] [16]

Accessed: 2025-12-12

2025

[17] [17]

https://openai.com/index/i ntroducing-gpt-5/ , August 2025

Introducing GPT-5. https://openai.com/index/i ntroducing-gpt-5/ , August 2025. Accessed: 2026- 02-20

2025

[18] [18]

AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances

Dhruv Agarwal, Mor Naaman, and Aditya Vashistha. AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. InProceed- ings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, pages 1–21. Associ- ation for Computing Machinery, April 2025. https: //doi.org/10.1145/3706598.3713564

work page doi:10.1145/3706598.3713564 2025

[19] [19]

Ahler, Carolyn E

Douglas J. Ahler, Carolyn E. Roush, and Gaurav Sood. The micro-task market for lemons: Data quality on Ama- zon’s Mechanical Turk.Political Science Research and Methods, 13(1):1–20, January 2025

2025

[20] [20]

Measuring the Impact of Early-2025 AI on Ex- perienced Open-Source Developer Productivity

Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the Impact of Early-2025 AI on Ex- perienced Open-Source Developer Productivity. ht tp://arxiv.org/abs/2507.09089 , July 2025. arXiv:2507.09089

work page arXiv 2025

[21] [21]

So You Want to Conduct an Online Survey? Strategies for Identifying and Eliminating Fraudulent Responses.International Journal of Integrated Care, 25(S2):142, August 2025

Isabelle Caven, Zhenxiao Yang, Marianne Saragosa, Yona Lunsky, Jill Cameron, Kristine Newman, Sue Bookey-Bassett, Shoshana Hahn-Goldberg, and Karen Okrainec. So You Want to Conduct an Online Survey? Strategies for Identifying and Eliminating Fraudulent Responses.International Journal of Integrated Care, 25(S2):142, August 2025. https://ijic.org/artic les/...

work page doi:10.5334/ijic.nacic24142 2025

[22] [22]

Nonnaïveté among Amazon Mechanical Turk work- ers: Consequences and solutions for behavioral re- searchers.Behavior Research Methods, 46(1):112–130, March 2014

Jesse Chandler, Pam Mueller, and Gabriele Paolacci. Nonnaïveté among Amazon Mechanical Turk work- ers: Consequences and solutions for behavioral re- searchers.Behavior Research Methods, 46(1):112–130, March 2014. https://doi.org/10.3758/s13428- 013-0365-7

work page doi:10.3758/s13428- 2014

[23] [23]

Identifying Bots Through LLM-Generated Text in Open Narrative Responses: A Proof-of-Concept Study.Social Science Computer Re- view, page 08944393251408022, January 2026

Joshua Claassen, Jan Karem Höhne, Ruben Bach, and Anna-Carolina Haensch. Identifying Bots Through LLM-Generated Text in Open Narrative Responses: A Proof-of-Concept Study.Social Science Computer Re- view, page 08944393251408022, January 2026. https: //journals.sagepub.com/doi/10.1177/0894439 3251408022

work page doi:10.1177/0894439 2026

[24] [24]

V oices from the algorithm: Large language models in social research

Emily Cox, Fiona Shirani, and Paul Rouse. V oices from the algorithm: Large language models in social research. Energy Research & Social Science, 113:103559, July

[25] [25]

https://www.sciencedirect.com/science/ article/pii/S2214629624001506

[26] [26]

Questioning the Survey Re- sponses of Large Language Models.Advances in Neural Information Processing Systems, 37:45850–45878, De- cember 2024

Ricardo Dominguez-Olmedo, Moritz Hardt, and Ce- lestine Mendler-Dünner. Questioning the Survey Re- sponses of Large Language Models.Advances in Neural Information Processing Systems, 37:45850–45878, De- cember 2024

2024

[27] [27]

Doshi and Oliver P

Anil R. Doshi and Oliver P. Hauser. Generative AI enhances individual creativity but reduces the collec- tive diversity of novel content.Science Advances, 10(28):eadn5290, July 2024. https://www.science. org/doi/10.1126/sciadv.adn5290

work page doi:10.1126/sciadv.adn5290 2024

[28] [28]

Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses.International Review of Economics Education, 49:100321, June 2025

Alexandra Fiedler and Jörg Döpke. Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses.International Review of Economics Education, 49:100321, June 2025. https: //www.sciencedirect.com/science/article/pi i/S1477388025000131

2025

[29] [29]

R. A. Fisher. The Logic of Inductive Inference.Journal of the Royal Statistical Society, 98(1):39–82, 1935. ht tps://www.jstor.org/stable/2342435

work page arXiv 1935

[30] [30]

Too Fast

Robert Greszki, Marco Meyer, and Harald Schoen. Ex- ploring the Effects of Removing “Too Fast” Responses and Respondents from Web Surveys.Public Opin- ion Quarterly, 79(2):471–503, January 2015. https: //doi.org/10.1093/poq/nfu058

work page doi:10.1093/poq/nfu058 2015

[31] [31]

J. B. Haldane. The estimation and significance of the logarithm of a ratio of frequencies.Annals of Human Genetics, 20(4):309–311, May 1956

1956

[32] [32]

The political ideology of conversational AI: Con- verging evidence on ChatGPT’s pro-environmental, left- libertarian orientation

Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. The political ideology of conversational AI: Con- verging evidence on ChatGPT’s pro-environmental, left- libertarian orientation. http://arxiv.org/abs/2301 .01768, January 2023. arXiv:2301.01768

work page arXiv 2023

[33] [33]

Bots in web survey interviews: A showcase.International Journal of Market Research, 67(1):3–12, January 2025

Jan Karem Höhne, Joshua Claassen, Saijal Shahania, and David Broneske. Bots in web survey interviews: A showcase.International Journal of Market Research, 67(1):3–12, January 2025. https://journals.sagep ub.com/doi/10.1177/14707853241297009

work page doi:10.1177/14707853241297009 2025

[34] [34]

Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines

Paul Jaccard. Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles, 37:241–72, January 1901

1901

[35] [35]

Nicolas Jacquemet, Stéphane Luchini, Julie Rosaz, and Jason F. Shogren. Truth Telling Under Oath.Manage- ment Science, 65(1):426–438, January 2019. https: //pubsonline.informs.org/doi/abs/10.1287/m nsc.2017.2892

work page doi:10.1287/m 2019

[36] [36]

Exhaustive or exhausting? Evidence on respondent fa- tigue in long surveys.Journal of Development Eco- nomics, 161:102992, March 2023

Dahyeon Jeong, Shilpa Aggarwal, Jonathan Robinson, Naresh Kumar, Alan Spearot, and David Sungho Park. Exhaustive or exhausting? Evidence on respondent fa- tigue in long surveys.Journal of Development Eco- nomics, 161:102992, March 2023. https://www.scie ncedirect.com/science/article/pii/S0304387 822001341

2023

[37] [37]

Udo-Imeh, Bonan Kou, and Tianyi Zhang

Samia Kabir, David N. Udo-Imeh, Bonan Kou, and Tianyi Zhang. Is Stack Overflow Obsolete? An Em- pirical Study of the Characteristics of ChatGPT An- swers to Stack Overflow Questions. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, pages 1–17. Associa- tion for Computing Machinery, May 2024. https: //dl.acm.org/doi/...

work page doi:10.1145/3613904.3642596 2024

[38] [38]

Caring about Sharing: User Percep- tions of Multiparty Data Sharing

Bailey Kacsmar, Kyle Tilbury, Miti Mazmudar, and Flo- rian Kerschbaum. Caring about Sharing: User Percep- tions of Multiparty Data Sharing. In31st USENIX Secu- rity Symposium (USENIX Security 22), pages 899–916,

[39] [39]

https://www.usenix.org/conference/usen ixsecurity22/presentation/kacsmar

[40] [40]

Cameron S. Kay. Why you shouldn’t trust data collected on MTurk.Behavior Research Methods, 57(12):340, November 2025. https://doi.org/10.3758/s134 28-025-02852-7

work page doi:10.3758/s134 2025

[41] [41]

Keith, Louis Tay, and Peter D

Melissa G. Keith, Louis Tay, and Peter D. Harms. Sys- tems Perspective of Amazon Mechanical Turk for Or- ganizational Research: Review and Recommendations. Frontiers in Psychology, 8:1359, August 2017. https: //www.frontiersin.org/journals/psychology/ articles/10.3389/fpsyg.2017.01359/full

work page doi:10.3389/fpsyg.2017.01359/full 2017

[42] [42]

Waggoner, Ryan Jewell, and Nicholas J

Ryan Kennedy, Scott Clifford, Tyler Burleigh, Philip D. Waggoner, Ryan Jewell, and Nicholas J. G. Winter. The shape of and solutions to the MTurk quality crisis.Po- litical Science Research and Methods, 8(4):614–629, October 2020. https://www.cambridge.org/core /journals/political-science-research-and- methods/article/shape-of-and-solutions-to- the-mturk-...

2020

[43] [43]

Delving into LLM-assisted writ- ing in biomedical publications through excess vocab- ulary.Science Advances, 11(27):eadt3813, July 2025

Dmitry Kobak, Rita González-Márquez, Em˝oke-Ágnes Horvát, and Jan Lause. Delving into LLM-assisted writ- ing in biomedical publications through excess vocab- ulary.Science Advances, 11(27):eadt3813, July 2025. https://www.science.org/doi/full/10.1126/s ciadv.adt3813

work page doi:10.1126/s 2025

[44] [44]

SAGE Publications, May 2018

Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. SAGE Publications, May 2018

2018

[45] [45]

Kupper and Kerry b

Lawrence L. Kupper and Kerry b. Hafner. On Assessing Interrater Agreement for Multiple Attribute Responses. Biometrics, 45(3):957–967, 1989. https://www.jsto r.org/stable/2531695

work page arXiv 1989

[46] [46]

Detecting the corruption of online questionnaires by artificial intelligence.Frontiers in Robotics and AI, 10:1277635, February 2024

Benjamin Lebrun, Sharon Temtsin, Andrew V onasch, and Christoph Bartneck. Detecting the corruption of online questionnaires by artificial intelligence.Frontiers in Robotics and AI, 10:1277635, February 2024. https: //www.frontiersin.org/articles/10.3389/fro bt.2023.1277635/full

work page doi:10.3389/fro 2024

[47] [47]

GPT detectors are biased against non- native English writers.Patterns, 4(7), July 2023

Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. GPT detectors are biased against non- native English writers.Patterns, 4(7), July 2023. ht tps://www.cell.com/patterns/abstract/S2666- 3899(23)00130-7

2023

[48] [48]

H. B. Mann and D. R. Whitney. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other.The Annals of Mathematical Statistics, 18(1):50–60, 1947. https://www.jstor.org/stab le/2236101

work page arXiv 1947

[49] [49]

Challenges and Opportunities for Survey Research in the Age of Generative AI: An Experience Report

Fairuz Nawer Meem, Justin Smith, and Brittany John- son. Challenges and Opportunities for Survey Research in the Age of Generative AI: An Experience Report. In2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pages 423–428, September 2024. https://ieeexplore.ieee.org/ document/10714567

work page arXiv 2024

[50] [50]

Contrasting Linguistic Patterns in Human and LLM-Generated News Text.Artificial Intelligence Review, 57(10):265, August 2024

Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, and David Vilares. Contrasting Linguistic Patterns in Human and LLM-Generated News Text.Artificial Intelligence Review, 57(10):265, August 2024. https://doi.org/ 10.1007/s10462-024-10903-2

work page doi:10.1007/s10462-024-10903-2 2024

[51] [51]

Ipeirotis

Gabriele Paolacci, Jesse Chandler, and Panagiotis G. Ipeirotis. Running experiments on Amazon Mechanical Turk.Judgment and Decision Making, 5(5):411–419, August 2010. https://www.cambridge.org/core/j ournals/judgment-and-decision-making/artic le/running-experiments-on-amazon-mechanic al-turk/BBD787F3B4DDB61119CBB215927CA39E

2010

[52] [52]

Karl Pearson. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling. In Samuel Kotz and Norman L. Johnson, editors,Break- throughs in Statistics: Methodology and Distribution, pages 11–28. Springer, New York, NY , 1...

work page doi:10.1007/978-1-4612-4380-9_2 1992

[53] [53]

Data quality of plat- forms and panels for online behavioral research.Behav- ior Research Methods, 54(4):1643–1662, August 2022

Eyal Peer, David Rothschild, Andrew Gordon, Zak Ev- ernden, and Ekaterina Damer. Data quality of plat- forms and panels for online behavioral research.Behav- ior Research Methods, 54(4):1643–1662, August 2022. https://doi.org/10.3758/s13428-021-01694-3

work page doi:10.3758/s13428-021-01694-3 2022

[54] [54]

Argasi ´nski, Iwona Grabska- Gradzi´nska, and Jeremi K

Karol Przystalski, Jan K. Argasi ´nski, Iwona Grabska- Gradzi´nska, and Jeremi K. Ochab. Stylometry recog- nizes human and LLM-generated texts in short samples. Expert Systems with Applications, 296:129001, January

[55] [55]

https://www.sciencedirect.com/science/ article/pii/S0957417425026181

[56] [56]

Calpric: In- clusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning

Wenjun Qiu, David Lie, and Lisa Austin. Calpric: In- clusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning. In32nd USENIX Security Symposium (USENIX Security 23), pages 1055– 1072, 2023. https://www.usenix.org/conferenc e/usenixsecurity23/presentation/qiu

2023

[57] [57]

Human Perception of LLM- generated Text Content in Social Media Environments

Kristina Radivojevic, Matthew Chou, Karla Badillo- Urquiola, and Paul Brenner. Human Perception of LLM- generated Text Content in Social Media Environments. http://arxiv.org/abs/2409.06653 , September

work page arXiv

[58] [58]

Do LLMs write like hu- mans? Variation in grammatical and rhetorical styles

Alex Reinhart, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, Gordon Weinberg, and David West Brown. Do LLMs write like hu- mans? Variation in grammatical and rhetorical styles. Proceedings of the National Academy of Sciences, 122(8):e2422455122, February 2025. https://www. pnas.org/doi/abs/10.1073/pnas.2422455122

work page doi:10.1073/pnas.2422455122 2025

[59] [59]

Recognising, Antic- ipating, and Mitigating LLM Pollution of Online Be- havioural Research

Raluca Rilla, Tobias Werner, Hiromu Yakura, Iyad Rah- wan, and Anne-Marie Nussberger. Recognising, Antic- ipating, and Mitigating LLM Pollution of Online Be- havioural Research. http://arxiv.org/abs/2508.0 1390, November 2025. arXiv:2508.01390

work page arXiv 2025

[60] [60]

Ritchey, Corina Jimenez-Gomez, and Christopher A

Carolyn M. Ritchey, Corina Jimenez-Gomez, and Christopher A. Podlesnik. Effects of pay rate and in- structions on attrition in crowdsourcing research.PLOS ONE, 18(10):e0292372, October 2023. https://pmc. ncbi.nlm.nih.gov/articles/PMC10550147/

2023

[61] [61]

The political preferences of LLMs

David Rozado. The political preferences of LLMs. PLOS ONE, 19(7):e0306621, July 2024. https:// journals.plos.org/plosone/article?id=10.13 71/journal.pone.0306621

2024

[62] [62]

SAGE Publications Ltd, fifth edition, March 2025

Johnny Saldana.The Coding Manual for Qualitative Re- searchers. SAGE Publications Ltd, fifth edition, March 2025

2025

[63] [63]

Whose Opinions Do Language Models Reflect? InProceedings of the 40th International Conference on Machine Learn- ing, pages 29971–30004

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose Opinions Do Language Models Reflect? InProceedings of the 40th International Conference on Machine Learn- ing, pages 29971–30004. PMLR, July 2023. https: //proceedings.mlr.press/v202/santurkar23a. html

2023

[64] [64]

It’s Something to Polish Your Own Thoughts, Rather than Create Thoughts for You

Steven Schirra, Sasha G V olkov, and Frank Bentley. "It’s Something to Polish Your Own Thoughts, Rather than Create Thoughts for You": Understanding Partic- ipants’ Use of Chatbots and LLMs During Online Re- search Participation. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–6. Ass...

work page doi:10.1145/3706599.3720027 2025

[65] [65]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting.http://arxiv. org/abs/2310.11324, July 2024. arXiv:2310.11324

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

Gotta Catch ’Em All

Saijal Shahania, Myra Spiliopoulou, and David Broneske. Gotta Catch ’Em All... Or Not?: How LLMs Bypass Traditional Checks & Mimic Human Response Behavior in Web Surveys. InProceedings of the ACM Collective Intelligence Conference, pages 113–128, San Diego CA USA, August 2025. ACM. https: //dl.acm.org/doi/10.1145/3715928.3737491

work page doi:10.1145/3715928.3737491 2025

[67] [67]

Assessing the quality and reliability of the Amazon Mechanical Turk (MTurk) data in 2024.Royal Society Open Science, 12(7):250361, July 2025

Hagar Shimoni and Vadim Axelrod. Assessing the quality and reliability of the Amazon Mechanical Turk (MTurk) data in 2024.Royal Society Open Science, 12(7):250361, July 2025. https://royalsocietypu blishing.org/doi/full/10.1098/rsos.250361

work page doi:10.1098/rsos.250361 2024

[68] [68]

Rectangular Confidence Regions for the Means of Multivariate Normal Distributions.Journal of the American Statistical Association, 62(318):626–633, June 1967

Zbynˇek Šidák. Rectangular Confidence Regions for the Means of Multivariate Normal Distributions.Journal of the American Statistical Association, 62(318):626–633, June 1967. https://doi.org/10.1080/01621459.1 967.10482935

work page doi:10.1080/01621459.1 1967

[69] [69]

Lucy Stafford, Catherine Preston, and Alexandra C. Pike. Participant Use of Artificial Intelligence in Online Focus Groups: An Experiential Account.International Jour- nal of Qualitative Methods, 23:16094069241286417, November 2024. https://doi.org/10.1177/1609 4069241286417

work page doi:10.1177/1609 2024

[70] [70]

Brian Jay Tang and Kang G. Shin. Eye-Shield: Real- Time Protection of Mobile Device Screen Information from Shoulder Surfing. In32nd USENIX Security Sym- posium (USENIX Security 23), pages 5449–5466, 2023. https://www.usenix.org/conference/usenixse curity23/presentation/tang

2023

[71] [71]

Repli- cation: How Well Do My Results Generalize Now? The External Validity of Online Privacy and Security Surveys

Jenny Tang, Eleanor Birrell, and Ada Lerner. Repli- cation: How Well Do My Results Generalize Now? The External Validity of Online Privacy and Security Surveys. InEighteenth Symposium on Usable Pri- vacy and Security (SOUPS 2022), pages 367–385, 2022. https://www.usenix.org/conference/soups202 2/presentation/tang

2022

[72] [72]

The Science of Detecting LLM-Generated Text.Commun

Ruixiang Tang, Yu-Neng Chuang, and Xia Hu. The Science of Detecting LLM-Generated Text.Commun. ACM, 67(4):50–59, March 2024. https://dl.acm.o rg/doi/10.1145/3624725

work page doi:10.1145/3624725 2024

[73] [73]

Do LLMs Exhibit Human-like Response Biases? A Case Study in Sur- vey Design.Transactions of the Association for Com- putational Linguistics, 12:1011–1026, September 2024

Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkwar, and Graham Neubig. Do LLMs Exhibit Human-like Response Biases? A Case Study in Sur- vey Design.Transactions of the Association for Com- putational Linguistics, 12:1011–1026, September 2024. https://doi.org/10.1162/tacl_a_00685

work page doi:10.1162/tacl_a_00685 2024

[74] [74]

The threat of AI chatbot responses to crowdsourced open-ended survey questions.Energy Research & Social Science, 119:103857, January 2025

Frederic Traylor. The threat of AI chatbot responses to crowdsourced open-ended survey questions.Energy Research & Social Science, 119:103857, January 2025. https://linkinghub.elsevier.com/retrieve/p ii/S2214629624004481

2025

[75] [75]

Cozzolino, Andrew Gordon, David Rothschild, and Robert West

Veniamin Veselovsky, Manoel Horta Ribeiro, Philip J. Cozzolino, Andrew Gordon, David Rothschild, and Robert West. Prevalence and Prevention of Large Lan- guage Model Use in Crowd Work.Commun. ACM, 68(3):42–47, February 2025. https://dl.acm.org/d oi/10.1145/3685527

work page doi:10.1145/3685527 2025

[76] [76]

Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks

Veniamin Veselovsky, Manoel Horta Ribeiro, and Robert West. Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks. http://arxiv.org/abs/2306.0 7899, June 2023. arXiv:2306.07899

work page arXiv 2023

[77] [77]

Kumar, and Jason Pridmore

Jessica Vitak, Yuting Liao, Anouk Mols, Daniel Trottier, Michael Zimmer, Priya C. Kumar, and Jason Pridmore. When Do Data Collection and Use Become a Matter of Concern? A Cross-Cultural Comparison of U.S. and Dutch Privacy Attitudes.International Journal of Com- munication, 17(0):28, 2023. https://ijoc.org/ind ex.php/ijoc/article/view/19391

2023

[78] [78]

Webb and June P

Margaret A. Webb and June P. Tangney. Too Good to Be True: Bots and Bad Data From Mechanical Turk. Perspectives on Psychological Science, 19(6):887–890, November 2024. https://doi.org/10.1177/1745 6916221120027

work page doi:10.1177/1745 2024

[79] [79]

Westwood

Sean J. Westwood. The potential existential threat of large language models to online survey research. Proceedings of the National Academy of Sciences, 122(47):e2518075122, November 2025. https://www. pnas.org/doi/full/10.1073/pnas.2518075122

work page doi:10.1073/pnas.2518075122 2025

[80] [80]

Wong, Shu Yang, Xinyi Yang, Yulin Yuan, and Lidia S

Junchao Wu, Runzhe Zhan, Derek F. Wong, Shu Yang, Xinyi Yang, Yulin Yuan, and Lidia S. Chao. Detec- tRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios.Advances in Neural Informa- tion Processing Systems, 37:100369–100401, December 2024

2024