pith. sign in

arxiv: 2607.00403 · v1 · pith:UWFHPG6Inew · submitted 2026-07-01 · 💻 cs.HC · cs.CR· cs.CY

A Penny for Your Prompts: Experiments Detecting and Mitigating LLM Usage by Survey Respondents

Pith reviewed 2026-07-02 06:56 UTC · model grok-4.3

classification 💻 cs.HC cs.CRcs.CY
keywords LLM detectionsurvey responsescrowdsourcing platformsMechanical TurkProlificdata qualitymitigationkeystroke analysis
0
0 comments X

The pith

LLM-assisted responses appear in under 10% of Prolific surveys but over 80% on Mechanical Turk, and can be detected through response characteristics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how frequently survey respondents on crowdsourcing platforms rely on large language models to generate answers. It identifies clear patterns that mark AI-assisted replies and measures how often they occur across different sites. Varying survey length, adding warnings against AI, and blocking copy-paste all lowered the rate of LLM use. These steps did not reliably raise the overall quality of the collected data. The authors conclude that researchers should log keystrokes and design questions to catch LLM involvement.

Core claim

Experiments with 250 participants showed that LLM-assisted survey responses exhibit distinct characteristics allowing detection, with prevalence ranging from under 10% on Prolific to over 80% on Mechanical Turk. Mitigation measures such as requests not to use AI and disabling copy-paste reduced LLM usage but did not necessarily improve data quality. No participants used browser agents, yet the study reports detection experiments and recommends keystroke recording plus targeted instructions and questions to screen for LLM assistance.

What carries the argument

Detection of LLM assistance through distinct response characteristics, tested via controlled variations in platform, survey length, anti-AI instructions, and copy-paste restrictions.

If this is right

  • Platform choice strongly affects the baseline rate of LLM-assisted survey responses.
  • Anti-AI instructions and copy-paste blocks lower the frequency of LLM use.
  • Lower LLM usage does not automatically produce higher-quality survey data.
  • Keystroke data provides a practical way to screen responses for LLM involvement.
  • Carefully worded instructions and questions can help both detect and discourage AI use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many existing online studies may contain undetected LLM-generated data whose impact on conclusions remains unknown.
  • Recruitment platforms could add built-in keystroke logging as a standard feature for researchers.
  • The same detection approach might extend to other online data sources such as interviews or open-ended feedback forms.

Load-bearing premise

The observed response characteristics are caused by LLM assistance rather than fatigue, language differences, or platform-specific user pools.

What would settle it

A side-by-side test of known human-written answers and known LLM-generated answers on the same survey questions to check whether the reported distinguishing characteristics separate the two groups reliably.

read the original abstract

Large language models are increasingly used by participants on crowdsourcing platforms when responding to surveys, potentially undermining the validity of collected data. Our study aims to quantify the prevalence of this behavior and investigate methods to detect and prevent it. In a series of surveys (N = 250), we examined conditions such as platform choice, survey length, requests not to use AI, and disabling copy-paste functionality. We were able to identify distinct characteristics of LLM-assisted responses and found that their frequency varied widely, from under 10% on Prolific to over 80% on Mechanical Turk. Mitigation measures reduced LLM usage but did not necessarily improve data quality. No participants employed browser-use agents at the time of our survey, but we report on our own detection experiments. We recommend that researchers actively screen survey responses for LLM usage by recording and analyzing keystroke data and crafting instructions and questions aimed at AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports results from a series of surveys (N=250) conducted on crowdsourcing platforms to quantify the prevalence of LLM-assisted responses, identify distinguishing characteristics of such responses, and test mitigation strategies including platform choice, survey length, explicit instructions against AI use, and disabling copy-paste. Key findings include platform-level prevalence differences (under 10% on Prolific vs. over 80% on Mechanical Turk), reduction in LLM usage from mitigations without consistent gains in data quality, absence of browser-agent use by participants, and a recommendation to screen responses via keystroke logging and targeted question design.

Significance. If the detection criteria and classification procedures prove reliable after clarification, the work would be significant for survey methodology and crowdsourced data collection. It provides empirical evidence of platform-specific differences in LLM usage and practical mitigation tests, directly relevant to researchers relying on platforms like MTurk and Prolific. The emphasis on keystroke data as a detection tool offers a concrete, implementable suggestion.

major comments (3)
  1. [Abstract / Methods] Abstract and Methods (inferred from reported N=250 and frequency claims): the exact criteria, features, or classification rules used to label responses as LLM-assisted are not specified, nor is any ground-truth validation, inter-rater reliability, or control for confounds such as participant fatigue or language background. This is load-bearing for the central prevalence estimates and platform comparisons.
  2. [Results] Results (frequency differences): no statistical tests, confidence intervals, error bars, or sample-size breakdowns per platform/condition are reported despite claims of clear differences (e.g., <10% vs. >80%). Without these, the robustness of the platform effect cannot be assessed.
  3. [Results / Discussion] Mitigation experiments: the claim that mitigations 'reduced LLM usage but did not necessarily improve data quality' requires explicit operationalization of 'data quality' and pre/post metrics; none are described, leaving the dissociation between usage reduction and quality untestable.
minor comments (2)
  1. [Abstract] The abstract states N=250 but does not break down the allocation across the multiple conditions (platform, length, instructions, copy-paste); a table or explicit counts would improve transparency.
  2. [Abstract] The statement that 'no participants employed browser-use agents' is presented without describing how this was verified or the scope of the detection experiments mentioned.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for improving the clarity and rigor of our work. We address each major comment below and commit to revisions where needed.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods (inferred from reported N=250 and frequency claims): the exact criteria, features, or classification rules used to label responses as LLM-assisted are not specified, nor is any ground-truth validation, inter-rater reliability, or control for confounds such as participant fatigue or language background. This is load-bearing for the central prevalence estimates and platform comparisons.

    Authors: We acknowledge that the submitted manuscript does not provide sufficient detail on the classification criteria. In the revision, we will expand the Methods section with a dedicated subsection explicitly listing the features and rules used to identify LLM-assisted responses, describing our ground-truth validation approach from controlled experiments, reporting any inter-rater reliability, and addressing potential confounds including participant fatigue and language background. revision: yes

  2. Referee: [Results] Results (frequency differences): no statistical tests, confidence intervals, error bars, or sample-size breakdowns per platform/condition are reported despite claims of clear differences (e.g., <10% vs. >80%). Without these, the robustness of the platform effect cannot be assessed.

    Authors: We agree that the Results section lacks the necessary statistical support. We will revise it to include appropriate statistical tests for the reported platform differences, add confidence intervals and error bars, and provide full sample-size breakdowns by platform and condition. revision: yes

  3. Referee: [Results / Discussion] Mitigation experiments: the claim that mitigations 'reduced LLM usage but did not necessarily improve data quality' requires explicit operationalization of 'data quality' and pre/post metrics; none are described, leaving the dissociation between usage reduction and quality untestable.

    Authors: This point is well taken. We will revise the manuscript to explicitly define and operationalize 'data quality' using specific metrics, and report pre- and post-mitigation comparisons on those metrics to substantiate the observed dissociation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical study

full rationale

This paper reports an empirical measurement study based on surveys (N=250) across platforms, testing conditions like platform choice, survey length, and mitigation measures. No derivations, equations, fitted parameters, or first-principles results are present that could reduce outputs to inputs by construction. Claims about LLM-response characteristics and prevalence differences rest on direct experimental observations and keystroke analysis rather than self-definitional loops, fitted-input predictions, or load-bearing self-citations. The work is self-contained as an experimental report with no mathematical chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms; relies on standard assumptions about survey response behavior and observable text features.

axioms (1)
  • domain assumption Observable text and interaction features can distinguish LLM-generated from human responses under the tested conditions
    Central to the identification of distinct characteristics reported in the abstract.

pith-pipeline@v0.9.1-grok · 5685 in / 1119 out tokens · 28514 ms · 2026-07-02T06:56:49.383143+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

125 extracted references · 41 canonical work pages · 1 internal anchor

  1. [1]

    https://help.x.com/en/using-x/ab out-grok

    About Grok. https://help.x.com/en/using-x/ab out-grok. Accessed: 2026-02-16

  2. [2]

    https://www.mturk.com/

    Amazon Mechanical Turk. https://www.mturk.com/. Accessed: 2025-12-12

  3. [3]

    https: //www.cloudresearch.com/

    CloudResearch - Online Research Platform. https: //www.cloudresearch.com/. Accessed: 2026-01-07

  4. [4]

    https://en.wikipedia.org/w/i ndex.php?title=Comet_(browser)&oldid=13359 36939

    Comet (browser). https://en.wikipedia.org/w/i ndex.php?title=Comet_(browser)&oldid=13359 36939. Accessed: 2026-02-03

  5. [5]

    https://support.claude.com/en/arti cles/12012173-getting-started-with-claude- in-chrome

    Getting Started with Claude in Chrome | Claude Help Center. https://support.claude.com/en/arti cles/12012173-getting-started-with-claude- in-chrome. Accessed: 2026-01-05

  6. [6]

    https://researcher-help.pro lific.com/en/articles/445146-how-to-add-au thenticity-checks-to-your-qualtrics-study

    How to add authenticity checks to your Qualtrics study | Prolific Research. https://researcher-help.pro lific.com/en/articles/445146-how-to-add-au thenticity-checks-to-your-qualtrics-study . Accessed: 2026-02-20

  7. [7]

    https://ai.goo gle.dev/gemini-api/docs/image-understandin g

    Image understanding | Gemini API. https://ai.goo gle.dev/gemini-api/docs/image-understandin g. Accessed: 2026-02-16

  8. [8]

    https://ww w.limesurvey.org/

    LimeSurvey — Free Online Survey Tool. https://ww w.limesurvey.org/. Accessed: 2026-02-20

  9. [9]

    https://web.respondus.com/ he/lockdownbrowser/

    LockDown Browser. https://web.respondus.com/ he/lockdownbrowser/. Accessed: 2026-02-20

  10. [10]

    https://blog.google/products-and- platforms/products/chrome/gemini-3-auto- browse/

    The new era of browsing: Putting Gemini to work in Chrome. https://blog.google/products-and- platforms/products/chrome/gemini-3-auto- browse/. Accessed: 2026-02-03

  11. [11]

    https://www.prolific.com

    Prolific | Easily collect high-quality data from real peo- ple. https://www.prolific.com . Accessed: 2025- 12-12

  12. [12]

    https://researcher-help.p rolific.com/en/articles/445153-prolific-s- attention-and-comprehension-check-policy

    Prolific’s Attention and Comprehension Check Policy | Prolific Research. https://researcher-help.p rolific.com/en/articles/445153-prolific-s- attention-and-comprehension-check-policy . Accessed: 2026-02-18

  13. [13]

    https://huggingface.co/sentence- transformers/all-MiniLM-L6-v2 , January 2024

    Sentence-transformers/all-MiniLM-L6-v2 · Hugging Face. https://huggingface.co/sentence- transformers/all-MiniLM-L6-v2 , January 2024. Accessed: 2026-01-20

  14. [14]

    https://deepmind.google/models /gemini/pro/, November 2025

    Gemini 3 Pro. https://deepmind.google/models /gemini/pro/, November 2025. Accessed: 2025-12- 12

  15. [15]

    https://www.anth ropic.com/news/claude-sonnet-4-5 , September

    Introducing Claude Sonnet 4.5. https://www.anth ropic.com/news/claude-sonnet-4-5 , September

  16. [16]

    Accessed: 2025-12-12

  17. [17]

    https://openai.com/index/i ntroducing-gpt-5/ , August 2025

    Introducing GPT-5. https://openai.com/index/i ntroducing-gpt-5/ , August 2025. Accessed: 2026- 02-20

  18. [18]

    AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances

    Dhruv Agarwal, Mor Naaman, and Aditya Vashistha. AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. InProceed- ings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, pages 1–21. Associ- ation for Computing Machinery, April 2025. https: //doi.org/10.1145/3706598.3713564

  19. [19]

    Ahler, Carolyn E

    Douglas J. Ahler, Carolyn E. Roush, and Gaurav Sood. The micro-task market for lemons: Data quality on Ama- zon’s Mechanical Turk.Political Science Research and Methods, 13(1):1–20, January 2025

  20. [20]

    Measuring the Impact of Early-2025 AI on Ex- perienced Open-Source Developer Productivity

    Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the Impact of Early-2025 AI on Ex- perienced Open-Source Developer Productivity. ht tp://arxiv.org/abs/2507.09089 , July 2025. arXiv:2507.09089

  21. [21]

    So You Want to Conduct an Online Survey? Strategies for Identifying and Eliminating Fraudulent Responses.International Journal of Integrated Care, 25(S2):142, August 2025

    Isabelle Caven, Zhenxiao Yang, Marianne Saragosa, Yona Lunsky, Jill Cameron, Kristine Newman, Sue Bookey-Bassett, Shoshana Hahn-Goldberg, and Karen Okrainec. So You Want to Conduct an Online Survey? Strategies for Identifying and Eliminating Fraudulent Responses.International Journal of Integrated Care, 25(S2):142, August 2025. https://ijic.org/artic les/...

  22. [22]

    Nonnaïveté among Amazon Mechanical Turk work- ers: Consequences and solutions for behavioral re- searchers.Behavior Research Methods, 46(1):112–130, March 2014

    Jesse Chandler, Pam Mueller, and Gabriele Paolacci. Nonnaïveté among Amazon Mechanical Turk work- ers: Consequences and solutions for behavioral re- searchers.Behavior Research Methods, 46(1):112–130, March 2014. https://doi.org/10.3758/s13428- 013-0365-7

  23. [23]

    Identifying Bots Through LLM-Generated Text in Open Narrative Responses: A Proof-of-Concept Study.Social Science Computer Re- view, page 08944393251408022, January 2026

    Joshua Claassen, Jan Karem Höhne, Ruben Bach, and Anna-Carolina Haensch. Identifying Bots Through LLM-Generated Text in Open Narrative Responses: A Proof-of-Concept Study.Social Science Computer Re- view, page 08944393251408022, January 2026. https: //journals.sagepub.com/doi/10.1177/0894439 3251408022

  24. [24]

    V oices from the algorithm: Large language models in social research

    Emily Cox, Fiona Shirani, and Paul Rouse. V oices from the algorithm: Large language models in social research. Energy Research & Social Science, 113:103559, July

  25. [25]

    https://www.sciencedirect.com/science/ article/pii/S2214629624001506

  26. [26]

    Questioning the Survey Re- sponses of Large Language Models.Advances in Neural Information Processing Systems, 37:45850–45878, De- cember 2024

    Ricardo Dominguez-Olmedo, Moritz Hardt, and Ce- lestine Mendler-Dünner. Questioning the Survey Re- sponses of Large Language Models.Advances in Neural Information Processing Systems, 37:45850–45878, De- cember 2024

  27. [27]

    Doshi and Oliver P

    Anil R. Doshi and Oliver P. Hauser. Generative AI enhances individual creativity but reduces the collec- tive diversity of novel content.Science Advances, 10(28):eadn5290, July 2024. https://www.science. org/doi/10.1126/sciadv.adn5290

  28. [28]

    Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses.International Review of Economics Education, 49:100321, June 2025

    Alexandra Fiedler and Jörg Döpke. Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses.International Review of Economics Education, 49:100321, June 2025. https: //www.sciencedirect.com/science/article/pi i/S1477388025000131

  29. [29]

    R. A. Fisher. The Logic of Inductive Inference.Journal of the Royal Statistical Society, 98(1):39–82, 1935. ht tps://www.jstor.org/stable/2342435

  30. [30]

    Too Fast

    Robert Greszki, Marco Meyer, and Harald Schoen. Ex- ploring the Effects of Removing “Too Fast” Responses and Respondents from Web Surveys.Public Opin- ion Quarterly, 79(2):471–503, January 2015. https: //doi.org/10.1093/poq/nfu058

  31. [31]

    J. B. Haldane. The estimation and significance of the logarithm of a ratio of frequencies.Annals of Human Genetics, 20(4):309–311, May 1956

  32. [32]

    The political ideology of conversational AI: Con- verging evidence on ChatGPT’s pro-environmental, left- libertarian orientation

    Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. The political ideology of conversational AI: Con- verging evidence on ChatGPT’s pro-environmental, left- libertarian orientation. http://arxiv.org/abs/2301 .01768, January 2023. arXiv:2301.01768

  33. [33]

    Bots in web survey interviews: A showcase.International Journal of Market Research, 67(1):3–12, January 2025

    Jan Karem Höhne, Joshua Claassen, Saijal Shahania, and David Broneske. Bots in web survey interviews: A showcase.International Journal of Market Research, 67(1):3–12, January 2025. https://journals.sagep ub.com/doi/10.1177/14707853241297009

  34. [34]

    Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines

    Paul Jaccard. Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles, 37:241–72, January 1901

  35. [35]

    Nicolas Jacquemet, Stéphane Luchini, Julie Rosaz, and Jason F. Shogren. Truth Telling Under Oath.Manage- ment Science, 65(1):426–438, January 2019. https: //pubsonline.informs.org/doi/abs/10.1287/m nsc.2017.2892

  36. [36]

    Exhaustive or exhausting? Evidence on respondent fa- tigue in long surveys.Journal of Development Eco- nomics, 161:102992, March 2023

    Dahyeon Jeong, Shilpa Aggarwal, Jonathan Robinson, Naresh Kumar, Alan Spearot, and David Sungho Park. Exhaustive or exhausting? Evidence on respondent fa- tigue in long surveys.Journal of Development Eco- nomics, 161:102992, March 2023. https://www.scie ncedirect.com/science/article/pii/S0304387 822001341

  37. [37]

    Udo-Imeh, Bonan Kou, and Tianyi Zhang

    Samia Kabir, David N. Udo-Imeh, Bonan Kou, and Tianyi Zhang. Is Stack Overflow Obsolete? An Em- pirical Study of the Characteristics of ChatGPT An- swers to Stack Overflow Questions. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, pages 1–17. Associa- tion for Computing Machinery, May 2024. https: //dl.acm.org/doi/...

  38. [38]

    Caring about Sharing: User Percep- tions of Multiparty Data Sharing

    Bailey Kacsmar, Kyle Tilbury, Miti Mazmudar, and Flo- rian Kerschbaum. Caring about Sharing: User Percep- tions of Multiparty Data Sharing. In31st USENIX Secu- rity Symposium (USENIX Security 22), pages 899–916,

  39. [39]

    https://www.usenix.org/conference/usen ixsecurity22/presentation/kacsmar

  40. [40]

    Cameron S. Kay. Why you shouldn’t trust data collected on MTurk.Behavior Research Methods, 57(12):340, November 2025. https://doi.org/10.3758/s134 28-025-02852-7

  41. [41]

    Keith, Louis Tay, and Peter D

    Melissa G. Keith, Louis Tay, and Peter D. Harms. Sys- tems Perspective of Amazon Mechanical Turk for Or- ganizational Research: Review and Recommendations. Frontiers in Psychology, 8:1359, August 2017. https: //www.frontiersin.org/journals/psychology/ articles/10.3389/fpsyg.2017.01359/full

  42. [42]

    Waggoner, Ryan Jewell, and Nicholas J

    Ryan Kennedy, Scott Clifford, Tyler Burleigh, Philip D. Waggoner, Ryan Jewell, and Nicholas J. G. Winter. The shape of and solutions to the MTurk quality crisis.Po- litical Science Research and Methods, 8(4):614–629, October 2020. https://www.cambridge.org/core /journals/political-science-research-and- methods/article/shape-of-and-solutions-to- the-mturk-...

  43. [43]

    Delving into LLM-assisted writ- ing in biomedical publications through excess vocab- ulary.Science Advances, 11(27):eadt3813, July 2025

    Dmitry Kobak, Rita González-Márquez, Em˝oke-Ágnes Horvát, and Jan Lause. Delving into LLM-assisted writ- ing in biomedical publications through excess vocab- ulary.Science Advances, 11(27):eadt3813, July 2025. https://www.science.org/doi/full/10.1126/s ciadv.adt3813

  44. [44]

    SAGE Publications, May 2018

    Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. SAGE Publications, May 2018

  45. [45]

    Kupper and Kerry b

    Lawrence L. Kupper and Kerry b. Hafner. On Assessing Interrater Agreement for Multiple Attribute Responses. Biometrics, 45(3):957–967, 1989. https://www.jsto r.org/stable/2531695

  46. [46]

    Detecting the corruption of online questionnaires by artificial intelligence.Frontiers in Robotics and AI, 10:1277635, February 2024

    Benjamin Lebrun, Sharon Temtsin, Andrew V onasch, and Christoph Bartneck. Detecting the corruption of online questionnaires by artificial intelligence.Frontiers in Robotics and AI, 10:1277635, February 2024. https: //www.frontiersin.org/articles/10.3389/fro bt.2023.1277635/full

  47. [47]

    GPT detectors are biased against non- native English writers.Patterns, 4(7), July 2023

    Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. GPT detectors are biased against non- native English writers.Patterns, 4(7), July 2023. ht tps://www.cell.com/patterns/abstract/S2666- 3899(23)00130-7

  48. [48]

    H. B. Mann and D. R. Whitney. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other.The Annals of Mathematical Statistics, 18(1):50–60, 1947. https://www.jstor.org/stab le/2236101

  49. [49]

    Challenges and Opportunities for Survey Research in the Age of Generative AI: An Experience Report

    Fairuz Nawer Meem, Justin Smith, and Brittany John- son. Challenges and Opportunities for Survey Research in the Age of Generative AI: An Experience Report. In2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pages 423–428, September 2024. https://ieeexplore.ieee.org/ document/10714567

  50. [50]

    Contrasting Linguistic Patterns in Human and LLM-Generated News Text.Artificial Intelligence Review, 57(10):265, August 2024

    Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, and David Vilares. Contrasting Linguistic Patterns in Human and LLM-Generated News Text.Artificial Intelligence Review, 57(10):265, August 2024. https://doi.org/ 10.1007/s10462-024-10903-2

  51. [51]

    Ipeirotis

    Gabriele Paolacci, Jesse Chandler, and Panagiotis G. Ipeirotis. Running experiments on Amazon Mechanical Turk.Judgment and Decision Making, 5(5):411–419, August 2010. https://www.cambridge.org/core/j ournals/judgment-and-decision-making/artic le/running-experiments-on-amazon-mechanic al-turk/BBD787F3B4DDB61119CBB215927CA39E

  52. [52]

    Karl Pearson. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling. In Samuel Kotz and Norman L. Johnson, editors,Break- throughs in Statistics: Methodology and Distribution, pages 11–28. Springer, New York, NY , 1...

  53. [53]

    Data quality of plat- forms and panels for online behavioral research.Behav- ior Research Methods, 54(4):1643–1662, August 2022

    Eyal Peer, David Rothschild, Andrew Gordon, Zak Ev- ernden, and Ekaterina Damer. Data quality of plat- forms and panels for online behavioral research.Behav- ior Research Methods, 54(4):1643–1662, August 2022. https://doi.org/10.3758/s13428-021-01694-3

  54. [54]

    Argasi ´nski, Iwona Grabska- Gradzi´nska, and Jeremi K

    Karol Przystalski, Jan K. Argasi ´nski, Iwona Grabska- Gradzi´nska, and Jeremi K. Ochab. Stylometry recog- nizes human and LLM-generated texts in short samples. Expert Systems with Applications, 296:129001, January

  55. [55]

    https://www.sciencedirect.com/science/ article/pii/S0957417425026181

  56. [56]

    Calpric: In- clusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning

    Wenjun Qiu, David Lie, and Lisa Austin. Calpric: In- clusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning. In32nd USENIX Security Symposium (USENIX Security 23), pages 1055– 1072, 2023. https://www.usenix.org/conferenc e/usenixsecurity23/presentation/qiu

  57. [57]

    Human Perception of LLM- generated Text Content in Social Media Environments

    Kristina Radivojevic, Matthew Chou, Karla Badillo- Urquiola, and Paul Brenner. Human Perception of LLM- generated Text Content in Social Media Environments. http://arxiv.org/abs/2409.06653 , September

  58. [58]

    Do LLMs write like hu- mans? Variation in grammatical and rhetorical styles

    Alex Reinhart, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, Gordon Weinberg, and David West Brown. Do LLMs write like hu- mans? Variation in grammatical and rhetorical styles. Proceedings of the National Academy of Sciences, 122(8):e2422455122, February 2025. https://www. pnas.org/doi/abs/10.1073/pnas.2422455122

  59. [59]

    Recognising, Antic- ipating, and Mitigating LLM Pollution of Online Be- havioural Research

    Raluca Rilla, Tobias Werner, Hiromu Yakura, Iyad Rah- wan, and Anne-Marie Nussberger. Recognising, Antic- ipating, and Mitigating LLM Pollution of Online Be- havioural Research. http://arxiv.org/abs/2508.0 1390, November 2025. arXiv:2508.01390

  60. [60]

    Ritchey, Corina Jimenez-Gomez, and Christopher A

    Carolyn M. Ritchey, Corina Jimenez-Gomez, and Christopher A. Podlesnik. Effects of pay rate and in- structions on attrition in crowdsourcing research.PLOS ONE, 18(10):e0292372, October 2023. https://pmc. ncbi.nlm.nih.gov/articles/PMC10550147/

  61. [61]

    The political preferences of LLMs

    David Rozado. The political preferences of LLMs. PLOS ONE, 19(7):e0306621, July 2024. https:// journals.plos.org/plosone/article?id=10.13 71/journal.pone.0306621

  62. [62]

    SAGE Publications Ltd, fifth edition, March 2025

    Johnny Saldana.The Coding Manual for Qualitative Re- searchers. SAGE Publications Ltd, fifth edition, March 2025

  63. [63]

    Whose Opinions Do Language Models Reflect? InProceedings of the 40th International Conference on Machine Learn- ing, pages 29971–30004

    Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose Opinions Do Language Models Reflect? InProceedings of the 40th International Conference on Machine Learn- ing, pages 29971–30004. PMLR, July 2023. https: //proceedings.mlr.press/v202/santurkar23a. html

  64. [64]

    It’s Something to Polish Your Own Thoughts, Rather than Create Thoughts for You

    Steven Schirra, Sasha G V olkov, and Frank Bentley. "It’s Something to Polish Your Own Thoughts, Rather than Create Thoughts for You": Understanding Partic- ipants’ Use of Chatbots and LLMs During Online Re- search Participation. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–6. Ass...

  65. [65]

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting.http://arxiv. org/abs/2310.11324, July 2024. arXiv:2310.11324

  66. [66]

    Gotta Catch ’Em All

    Saijal Shahania, Myra Spiliopoulou, and David Broneske. Gotta Catch ’Em All... Or Not?: How LLMs Bypass Traditional Checks & Mimic Human Response Behavior in Web Surveys. InProceedings of the ACM Collective Intelligence Conference, pages 113–128, San Diego CA USA, August 2025. ACM. https: //dl.acm.org/doi/10.1145/3715928.3737491

  67. [67]

    Assessing the quality and reliability of the Amazon Mechanical Turk (MTurk) data in 2024.Royal Society Open Science, 12(7):250361, July 2025

    Hagar Shimoni and Vadim Axelrod. Assessing the quality and reliability of the Amazon Mechanical Turk (MTurk) data in 2024.Royal Society Open Science, 12(7):250361, July 2025. https://royalsocietypu blishing.org/doi/full/10.1098/rsos.250361

  68. [68]

    Rectangular Confidence Regions for the Means of Multivariate Normal Distributions.Journal of the American Statistical Association, 62(318):626–633, June 1967

    Zbynˇek Šidák. Rectangular Confidence Regions for the Means of Multivariate Normal Distributions.Journal of the American Statistical Association, 62(318):626–633, June 1967. https://doi.org/10.1080/01621459.1 967.10482935

  69. [69]

    Lucy Stafford, Catherine Preston, and Alexandra C. Pike. Participant Use of Artificial Intelligence in Online Focus Groups: An Experiential Account.International Jour- nal of Qualitative Methods, 23:16094069241286417, November 2024. https://doi.org/10.1177/1609 4069241286417

  70. [70]

    Brian Jay Tang and Kang G. Shin. Eye-Shield: Real- Time Protection of Mobile Device Screen Information from Shoulder Surfing. In32nd USENIX Security Sym- posium (USENIX Security 23), pages 5449–5466, 2023. https://www.usenix.org/conference/usenixse curity23/presentation/tang

  71. [71]

    Repli- cation: How Well Do My Results Generalize Now? The External Validity of Online Privacy and Security Surveys

    Jenny Tang, Eleanor Birrell, and Ada Lerner. Repli- cation: How Well Do My Results Generalize Now? The External Validity of Online Privacy and Security Surveys. InEighteenth Symposium on Usable Pri- vacy and Security (SOUPS 2022), pages 367–385, 2022. https://www.usenix.org/conference/soups202 2/presentation/tang

  72. [72]

    The Science of Detecting LLM-Generated Text.Commun

    Ruixiang Tang, Yu-Neng Chuang, and Xia Hu. The Science of Detecting LLM-Generated Text.Commun. ACM, 67(4):50–59, March 2024. https://dl.acm.o rg/doi/10.1145/3624725

  73. [73]

    Do LLMs Exhibit Human-like Response Biases? A Case Study in Sur- vey Design.Transactions of the Association for Com- putational Linguistics, 12:1011–1026, September 2024

    Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkwar, and Graham Neubig. Do LLMs Exhibit Human-like Response Biases? A Case Study in Sur- vey Design.Transactions of the Association for Com- putational Linguistics, 12:1011–1026, September 2024. https://doi.org/10.1162/tacl_a_00685

  74. [74]

    The threat of AI chatbot responses to crowdsourced open-ended survey questions.Energy Research & Social Science, 119:103857, January 2025

    Frederic Traylor. The threat of AI chatbot responses to crowdsourced open-ended survey questions.Energy Research & Social Science, 119:103857, January 2025. https://linkinghub.elsevier.com/retrieve/p ii/S2214629624004481

  75. [75]

    Cozzolino, Andrew Gordon, David Rothschild, and Robert West

    Veniamin Veselovsky, Manoel Horta Ribeiro, Philip J. Cozzolino, Andrew Gordon, David Rothschild, and Robert West. Prevalence and Prevention of Large Lan- guage Model Use in Crowd Work.Commun. ACM, 68(3):42–47, February 2025. https://dl.acm.org/d oi/10.1145/3685527

  76. [76]

    Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks

    Veniamin Veselovsky, Manoel Horta Ribeiro, and Robert West. Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks. http://arxiv.org/abs/2306.0 7899, June 2023. arXiv:2306.07899

  77. [77]

    Kumar, and Jason Pridmore

    Jessica Vitak, Yuting Liao, Anouk Mols, Daniel Trottier, Michael Zimmer, Priya C. Kumar, and Jason Pridmore. When Do Data Collection and Use Become a Matter of Concern? A Cross-Cultural Comparison of U.S. and Dutch Privacy Attitudes.International Journal of Com- munication, 17(0):28, 2023. https://ijoc.org/ind ex.php/ijoc/article/view/19391

  78. [78]

    Webb and June P

    Margaret A. Webb and June P. Tangney. Too Good to Be True: Bots and Bad Data From Mechanical Turk. Perspectives on Psychological Science, 19(6):887–890, November 2024. https://doi.org/10.1177/1745 6916221120027

  79. [79]

    Westwood

    Sean J. Westwood. The potential existential threat of large language models to online survey research. Proceedings of the National Academy of Sciences, 122(47):e2518075122, November 2025. https://www. pnas.org/doi/full/10.1073/pnas.2518075122

  80. [80]

    Wong, Shu Yang, Xinyi Yang, Yulin Yuan, and Lidia S

    Junchao Wu, Runzhe Zhan, Derek F. Wong, Shu Yang, Xinyi Yang, Yulin Yuan, and Lidia S. Chao. Detec- tRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios.Advances in Neural Informa- tion Processing Systems, 37:100369–100401, December 2024

Showing first 80 references.