A Penny for Your Prompts: Experiments Detecting and Mitigating LLM Usage by Survey Respondents
Pith reviewed 2026-07-02 06:56 UTC · model grok-4.3
The pith
LLM-assisted responses appear in under 10% of Prolific surveys but over 80% on Mechanical Turk, and can be detected through response characteristics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments with 250 participants showed that LLM-assisted survey responses exhibit distinct characteristics allowing detection, with prevalence ranging from under 10% on Prolific to over 80% on Mechanical Turk. Mitigation measures such as requests not to use AI and disabling copy-paste reduced LLM usage but did not necessarily improve data quality. No participants used browser agents, yet the study reports detection experiments and recommends keystroke recording plus targeted instructions and questions to screen for LLM assistance.
What carries the argument
Detection of LLM assistance through distinct response characteristics, tested via controlled variations in platform, survey length, anti-AI instructions, and copy-paste restrictions.
If this is right
- Platform choice strongly affects the baseline rate of LLM-assisted survey responses.
- Anti-AI instructions and copy-paste blocks lower the frequency of LLM use.
- Lower LLM usage does not automatically produce higher-quality survey data.
- Keystroke data provides a practical way to screen responses for LLM involvement.
- Carefully worded instructions and questions can help both detect and discourage AI use.
Where Pith is reading between the lines
- Many existing online studies may contain undetected LLM-generated data whose impact on conclusions remains unknown.
- Recruitment platforms could add built-in keystroke logging as a standard feature for researchers.
- The same detection approach might extend to other online data sources such as interviews or open-ended feedback forms.
Load-bearing premise
The observed response characteristics are caused by LLM assistance rather than fatigue, language differences, or platform-specific user pools.
What would settle it
A side-by-side test of known human-written answers and known LLM-generated answers on the same survey questions to check whether the reported distinguishing characteristics separate the two groups reliably.
read the original abstract
Large language models are increasingly used by participants on crowdsourcing platforms when responding to surveys, potentially undermining the validity of collected data. Our study aims to quantify the prevalence of this behavior and investigate methods to detect and prevent it. In a series of surveys (N = 250), we examined conditions such as platform choice, survey length, requests not to use AI, and disabling copy-paste functionality. We were able to identify distinct characteristics of LLM-assisted responses and found that their frequency varied widely, from under 10% on Prolific to over 80% on Mechanical Turk. Mitigation measures reduced LLM usage but did not necessarily improve data quality. No participants employed browser-use agents at the time of our survey, but we report on our own detection experiments. We recommend that researchers actively screen survey responses for LLM usage by recording and analyzing keystroke data and crafting instructions and questions aimed at AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports results from a series of surveys (N=250) conducted on crowdsourcing platforms to quantify the prevalence of LLM-assisted responses, identify distinguishing characteristics of such responses, and test mitigation strategies including platform choice, survey length, explicit instructions against AI use, and disabling copy-paste. Key findings include platform-level prevalence differences (under 10% on Prolific vs. over 80% on Mechanical Turk), reduction in LLM usage from mitigations without consistent gains in data quality, absence of browser-agent use by participants, and a recommendation to screen responses via keystroke logging and targeted question design.
Significance. If the detection criteria and classification procedures prove reliable after clarification, the work would be significant for survey methodology and crowdsourced data collection. It provides empirical evidence of platform-specific differences in LLM usage and practical mitigation tests, directly relevant to researchers relying on platforms like MTurk and Prolific. The emphasis on keystroke data as a detection tool offers a concrete, implementable suggestion.
major comments (3)
- [Abstract / Methods] Abstract and Methods (inferred from reported N=250 and frequency claims): the exact criteria, features, or classification rules used to label responses as LLM-assisted are not specified, nor is any ground-truth validation, inter-rater reliability, or control for confounds such as participant fatigue or language background. This is load-bearing for the central prevalence estimates and platform comparisons.
- [Results] Results (frequency differences): no statistical tests, confidence intervals, error bars, or sample-size breakdowns per platform/condition are reported despite claims of clear differences (e.g., <10% vs. >80%). Without these, the robustness of the platform effect cannot be assessed.
- [Results / Discussion] Mitigation experiments: the claim that mitigations 'reduced LLM usage but did not necessarily improve data quality' requires explicit operationalization of 'data quality' and pre/post metrics; none are described, leaving the dissociation between usage reduction and quality untestable.
minor comments (2)
- [Abstract] The abstract states N=250 but does not break down the allocation across the multiple conditions (platform, length, instructions, copy-paste); a table or explicit counts would improve transparency.
- [Abstract] The statement that 'no participants employed browser-use agents' is presented without describing how this was verified or the scope of the detection experiments mentioned.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for improving the clarity and rigor of our work. We address each major comment below and commit to revisions where needed.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods (inferred from reported N=250 and frequency claims): the exact criteria, features, or classification rules used to label responses as LLM-assisted are not specified, nor is any ground-truth validation, inter-rater reliability, or control for confounds such as participant fatigue or language background. This is load-bearing for the central prevalence estimates and platform comparisons.
Authors: We acknowledge that the submitted manuscript does not provide sufficient detail on the classification criteria. In the revision, we will expand the Methods section with a dedicated subsection explicitly listing the features and rules used to identify LLM-assisted responses, describing our ground-truth validation approach from controlled experiments, reporting any inter-rater reliability, and addressing potential confounds including participant fatigue and language background. revision: yes
-
Referee: [Results] Results (frequency differences): no statistical tests, confidence intervals, error bars, or sample-size breakdowns per platform/condition are reported despite claims of clear differences (e.g., <10% vs. >80%). Without these, the robustness of the platform effect cannot be assessed.
Authors: We agree that the Results section lacks the necessary statistical support. We will revise it to include appropriate statistical tests for the reported platform differences, add confidence intervals and error bars, and provide full sample-size breakdowns by platform and condition. revision: yes
-
Referee: [Results / Discussion] Mitigation experiments: the claim that mitigations 'reduced LLM usage but did not necessarily improve data quality' requires explicit operationalization of 'data quality' and pre/post metrics; none are described, leaving the dissociation between usage reduction and quality untestable.
Authors: This point is well taken. We will revise the manuscript to explicitly define and operationalize 'data quality' using specific metrics, and report pre- and post-mitigation comparisons on those metrics to substantiate the observed dissociation. revision: yes
Circularity Check
No significant circularity; purely empirical study
full rationale
This paper reports an empirical measurement study based on surveys (N=250) across platforms, testing conditions like platform choice, survey length, and mitigation measures. No derivations, equations, fitted parameters, or first-principles results are present that could reduce outputs to inputs by construction. Claims about LLM-response characteristics and prevalence differences rest on direct experimental observations and keystroke analysis rather than self-definitional loops, fitted-input predictions, or load-bearing self-citations. The work is self-contained as an experimental report with no mathematical chain to inspect for circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Observable text and interaction features can distinguish LLM-generated from human responses under the tested conditions
Reference graph
Works this paper leans on
-
[1]
https://help.x.com/en/using-x/ab out-grok
About Grok. https://help.x.com/en/using-x/ab out-grok. Accessed: 2026-02-16
2026
-
[2]
https://www.mturk.com/
Amazon Mechanical Turk. https://www.mturk.com/. Accessed: 2025-12-12
2025
-
[3]
https: //www.cloudresearch.com/
CloudResearch - Online Research Platform. https: //www.cloudresearch.com/. Accessed: 2026-01-07
2026
-
[4]
https://en.wikipedia.org/w/i ndex.php?title=Comet_(browser)&oldid=13359 36939
Comet (browser). https://en.wikipedia.org/w/i ndex.php?title=Comet_(browser)&oldid=13359 36939. Accessed: 2026-02-03
2026
-
[5]
https://support.claude.com/en/arti cles/12012173-getting-started-with-claude- in-chrome
Getting Started with Claude in Chrome | Claude Help Center. https://support.claude.com/en/arti cles/12012173-getting-started-with-claude- in-chrome. Accessed: 2026-01-05
-
[6]
https://researcher-help.pro lific.com/en/articles/445146-how-to-add-au thenticity-checks-to-your-qualtrics-study
How to add authenticity checks to your Qualtrics study | Prolific Research. https://researcher-help.pro lific.com/en/articles/445146-how-to-add-au thenticity-checks-to-your-qualtrics-study . Accessed: 2026-02-20
2026
-
[7]
https://ai.goo gle.dev/gemini-api/docs/image-understandin g
Image understanding | Gemini API. https://ai.goo gle.dev/gemini-api/docs/image-understandin g. Accessed: 2026-02-16
2026
-
[8]
https://ww w.limesurvey.org/
LimeSurvey — Free Online Survey Tool. https://ww w.limesurvey.org/. Accessed: 2026-02-20
2026
-
[9]
https://web.respondus.com/ he/lockdownbrowser/
LockDown Browser. https://web.respondus.com/ he/lockdownbrowser/. Accessed: 2026-02-20
2026
-
[10]
https://blog.google/products-and- platforms/products/chrome/gemini-3-auto- browse/
The new era of browsing: Putting Gemini to work in Chrome. https://blog.google/products-and- platforms/products/chrome/gemini-3-auto- browse/. Accessed: 2026-02-03
2026
-
[11]
https://www.prolific.com
Prolific | Easily collect high-quality data from real peo- ple. https://www.prolific.com . Accessed: 2025- 12-12
2025
-
[12]
https://researcher-help.p rolific.com/en/articles/445153-prolific-s- attention-and-comprehension-check-policy
Prolific’s Attention and Comprehension Check Policy | Prolific Research. https://researcher-help.p rolific.com/en/articles/445153-prolific-s- attention-and-comprehension-check-policy . Accessed: 2026-02-18
2026
-
[13]
https://huggingface.co/sentence- transformers/all-MiniLM-L6-v2 , January 2024
Sentence-transformers/all-MiniLM-L6-v2 · Hugging Face. https://huggingface.co/sentence- transformers/all-MiniLM-L6-v2 , January 2024. Accessed: 2026-01-20
2024
-
[14]
https://deepmind.google/models /gemini/pro/, November 2025
Gemini 3 Pro. https://deepmind.google/models /gemini/pro/, November 2025. Accessed: 2025-12- 12
2025
-
[15]
https://www.anth ropic.com/news/claude-sonnet-4-5 , September
Introducing Claude Sonnet 4.5. https://www.anth ropic.com/news/claude-sonnet-4-5 , September
-
[16]
Accessed: 2025-12-12
2025
-
[17]
https://openai.com/index/i ntroducing-gpt-5/ , August 2025
Introducing GPT-5. https://openai.com/index/i ntroducing-gpt-5/ , August 2025. Accessed: 2026- 02-20
2025
-
[18]
AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances
Dhruv Agarwal, Mor Naaman, and Aditya Vashistha. AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. InProceed- ings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, pages 1–21. Associ- ation for Computing Machinery, April 2025. https: //doi.org/10.1145/3706598.3713564
-
[19]
Ahler, Carolyn E
Douglas J. Ahler, Carolyn E. Roush, and Gaurav Sood. The micro-task market for lemons: Data quality on Ama- zon’s Mechanical Turk.Political Science Research and Methods, 13(1):1–20, January 2025
2025
-
[20]
Measuring the Impact of Early-2025 AI on Ex- perienced Open-Source Developer Productivity
Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the Impact of Early-2025 AI on Ex- perienced Open-Source Developer Productivity. ht tp://arxiv.org/abs/2507.09089 , July 2025. arXiv:2507.09089
-
[21]
Isabelle Caven, Zhenxiao Yang, Marianne Saragosa, Yona Lunsky, Jill Cameron, Kristine Newman, Sue Bookey-Bassett, Shoshana Hahn-Goldberg, and Karen Okrainec. So You Want to Conduct an Online Survey? Strategies for Identifying and Eliminating Fraudulent Responses.International Journal of Integrated Care, 25(S2):142, August 2025. https://ijic.org/artic les/...
-
[22]
Jesse Chandler, Pam Mueller, and Gabriele Paolacci. Nonnaïveté among Amazon Mechanical Turk work- ers: Consequences and solutions for behavioral re- searchers.Behavior Research Methods, 46(1):112–130, March 2014. https://doi.org/10.3758/s13428- 013-0365-7
-
[23]
Joshua Claassen, Jan Karem Höhne, Ruben Bach, and Anna-Carolina Haensch. Identifying Bots Through LLM-Generated Text in Open Narrative Responses: A Proof-of-Concept Study.Social Science Computer Re- view, page 08944393251408022, January 2026. https: //journals.sagepub.com/doi/10.1177/0894439 3251408022
-
[24]
V oices from the algorithm: Large language models in social research
Emily Cox, Fiona Shirani, and Paul Rouse. V oices from the algorithm: Large language models in social research. Energy Research & Social Science, 113:103559, July
-
[25]
https://www.sciencedirect.com/science/ article/pii/S2214629624001506
-
[26]
Questioning the Survey Re- sponses of Large Language Models.Advances in Neural Information Processing Systems, 37:45850–45878, De- cember 2024
Ricardo Dominguez-Olmedo, Moritz Hardt, and Ce- lestine Mendler-Dünner. Questioning the Survey Re- sponses of Large Language Models.Advances in Neural Information Processing Systems, 37:45850–45878, De- cember 2024
2024
-
[27]
Anil R. Doshi and Oliver P. Hauser. Generative AI enhances individual creativity but reduces the collec- tive diversity of novel content.Science Advances, 10(28):eadn5290, July 2024. https://www.science. org/doi/10.1126/sciadv.adn5290
-
[28]
Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses.International Review of Economics Education, 49:100321, June 2025
Alexandra Fiedler and Jörg Döpke. Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses.International Review of Economics Education, 49:100321, June 2025. https: //www.sciencedirect.com/science/article/pi i/S1477388025000131
2025
- [29]
-
[30]
Robert Greszki, Marco Meyer, and Harald Schoen. Ex- ploring the Effects of Removing “Too Fast” Responses and Respondents from Web Surveys.Public Opin- ion Quarterly, 79(2):471–503, January 2015. https: //doi.org/10.1093/poq/nfu058
-
[31]
J. B. Haldane. The estimation and significance of the logarithm of a ratio of frequencies.Annals of Human Genetics, 20(4):309–311, May 1956
1956
-
[32]
Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. The political ideology of conversational AI: Con- verging evidence on ChatGPT’s pro-environmental, left- libertarian orientation. http://arxiv.org/abs/2301 .01768, January 2023. arXiv:2301.01768
-
[33]
Jan Karem Höhne, Joshua Claassen, Saijal Shahania, and David Broneske. Bots in web survey interviews: A showcase.International Journal of Market Research, 67(1):3–12, January 2025. https://journals.sagep ub.com/doi/10.1177/14707853241297009
-
[34]
Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines
Paul Jaccard. Distribution de la Flore Alpine dans le Bassin des Dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles, 37:241–72, January 1901
1901
-
[35]
Nicolas Jacquemet, Stéphane Luchini, Julie Rosaz, and Jason F. Shogren. Truth Telling Under Oath.Manage- ment Science, 65(1):426–438, January 2019. https: //pubsonline.informs.org/doi/abs/10.1287/m nsc.2017.2892
work page doi:10.1287/m 2019
-
[36]
Exhaustive or exhausting? Evidence on respondent fa- tigue in long surveys.Journal of Development Eco- nomics, 161:102992, March 2023
Dahyeon Jeong, Shilpa Aggarwal, Jonathan Robinson, Naresh Kumar, Alan Spearot, and David Sungho Park. Exhaustive or exhausting? Evidence on respondent fa- tigue in long surveys.Journal of Development Eco- nomics, 161:102992, March 2023. https://www.scie ncedirect.com/science/article/pii/S0304387 822001341
2023
-
[37]
Udo-Imeh, Bonan Kou, and Tianyi Zhang
Samia Kabir, David N. Udo-Imeh, Bonan Kou, and Tianyi Zhang. Is Stack Overflow Obsolete? An Em- pirical Study of the Characteristics of ChatGPT An- swers to Stack Overflow Questions. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, pages 1–17. Associa- tion for Computing Machinery, May 2024. https: //dl.acm.org/doi/...
-
[38]
Caring about Sharing: User Percep- tions of Multiparty Data Sharing
Bailey Kacsmar, Kyle Tilbury, Miti Mazmudar, and Flo- rian Kerschbaum. Caring about Sharing: User Percep- tions of Multiparty Data Sharing. In31st USENIX Secu- rity Symposium (USENIX Security 22), pages 899–916,
-
[39]
https://www.usenix.org/conference/usen ixsecurity22/presentation/kacsmar
-
[40]
Cameron S. Kay. Why you shouldn’t trust data collected on MTurk.Behavior Research Methods, 57(12):340, November 2025. https://doi.org/10.3758/s134 28-025-02852-7
-
[41]
Melissa G. Keith, Louis Tay, and Peter D. Harms. Sys- tems Perspective of Amazon Mechanical Turk for Or- ganizational Research: Review and Recommendations. Frontiers in Psychology, 8:1359, August 2017. https: //www.frontiersin.org/journals/psychology/ articles/10.3389/fpsyg.2017.01359/full
-
[42]
Waggoner, Ryan Jewell, and Nicholas J
Ryan Kennedy, Scott Clifford, Tyler Burleigh, Philip D. Waggoner, Ryan Jewell, and Nicholas J. G. Winter. The shape of and solutions to the MTurk quality crisis.Po- litical Science Research and Methods, 8(4):614–629, October 2020. https://www.cambridge.org/core /journals/political-science-research-and- methods/article/shape-of-and-solutions-to- the-mturk-...
2020
-
[43]
Dmitry Kobak, Rita González-Márquez, Em˝oke-Ágnes Horvát, and Jan Lause. Delving into LLM-assisted writ- ing in biomedical publications through excess vocab- ulary.Science Advances, 11(27):eadt3813, July 2025. https://www.science.org/doi/full/10.1126/s ciadv.adt3813
work page doi:10.1126/s 2025
-
[44]
SAGE Publications, May 2018
Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. SAGE Publications, May 2018
2018
-
[45]
Lawrence L. Kupper and Kerry b. Hafner. On Assessing Interrater Agreement for Multiple Attribute Responses. Biometrics, 45(3):957–967, 1989. https://www.jsto r.org/stable/2531695
-
[46]
Benjamin Lebrun, Sharon Temtsin, Andrew V onasch, and Christoph Bartneck. Detecting the corruption of online questionnaires by artificial intelligence.Frontiers in Robotics and AI, 10:1277635, February 2024. https: //www.frontiersin.org/articles/10.3389/fro bt.2023.1277635/full
work page doi:10.3389/fro 2024
-
[47]
GPT detectors are biased against non- native English writers.Patterns, 4(7), July 2023
Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. GPT detectors are biased against non- native English writers.Patterns, 4(7), July 2023. ht tps://www.cell.com/patterns/abstract/S2666- 3899(23)00130-7
2023
- [48]
-
[49]
Challenges and Opportunities for Survey Research in the Age of Generative AI: An Experience Report
Fairuz Nawer Meem, Justin Smith, and Brittany John- son. Challenges and Opportunities for Survey Research in the Age of Generative AI: An Experience Report. In2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pages 423–428, September 2024. https://ieeexplore.ieee.org/ document/10714567
-
[50]
Alberto Muñoz-Ortiz, Carlos Gómez-Rodríguez, and David Vilares. Contrasting Linguistic Patterns in Human and LLM-Generated News Text.Artificial Intelligence Review, 57(10):265, August 2024. https://doi.org/ 10.1007/s10462-024-10903-2
-
[51]
Ipeirotis
Gabriele Paolacci, Jesse Chandler, and Panagiotis G. Ipeirotis. Running experiments on Amazon Mechanical Turk.Judgment and Decision Making, 5(5):411–419, August 2010. https://www.cambridge.org/core/j ournals/judgment-and-decision-making/artic le/running-experiments-on-amazon-mechanic al-turk/BBD787F3B4DDB61119CBB215927CA39E
2010
-
[52]
Karl Pearson. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling. In Samuel Kotz and Norman L. Johnson, editors,Break- throughs in Statistics: Methodology and Distribution, pages 11–28. Springer, New York, NY , 1...
-
[53]
Eyal Peer, David Rothschild, Andrew Gordon, Zak Ev- ernden, and Ekaterina Damer. Data quality of plat- forms and panels for online behavioral research.Behav- ior Research Methods, 54(4):1643–1662, August 2022. https://doi.org/10.3758/s13428-021-01694-3
-
[54]
Argasi ´nski, Iwona Grabska- Gradzi´nska, and Jeremi K
Karol Przystalski, Jan K. Argasi ´nski, Iwona Grabska- Gradzi´nska, and Jeremi K. Ochab. Stylometry recog- nizes human and LLM-generated texts in short samples. Expert Systems with Applications, 296:129001, January
-
[55]
https://www.sciencedirect.com/science/ article/pii/S0957417425026181
-
[56]
Calpric: In- clusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning
Wenjun Qiu, David Lie, and Lisa Austin. Calpric: In- clusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning. In32nd USENIX Security Symposium (USENIX Security 23), pages 1055– 1072, 2023. https://www.usenix.org/conferenc e/usenixsecurity23/presentation/qiu
2023
-
[57]
Human Perception of LLM- generated Text Content in Social Media Environments
Kristina Radivojevic, Matthew Chou, Karla Badillo- Urquiola, and Paul Brenner. Human Perception of LLM- generated Text Content in Social Media Environments. http://arxiv.org/abs/2409.06653 , September
-
[58]
Do LLMs write like hu- mans? Variation in grammatical and rhetorical styles
Alex Reinhart, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, Gordon Weinberg, and David West Brown. Do LLMs write like hu- mans? Variation in grammatical and rhetorical styles. Proceedings of the National Academy of Sciences, 122(8):e2422455122, February 2025. https://www. pnas.org/doi/abs/10.1073/pnas.2422455122
-
[59]
Recognising, Antic- ipating, and Mitigating LLM Pollution of Online Be- havioural Research
Raluca Rilla, Tobias Werner, Hiromu Yakura, Iyad Rah- wan, and Anne-Marie Nussberger. Recognising, Antic- ipating, and Mitigating LLM Pollution of Online Be- havioural Research. http://arxiv.org/abs/2508.0 1390, November 2025. arXiv:2508.01390
-
[60]
Ritchey, Corina Jimenez-Gomez, and Christopher A
Carolyn M. Ritchey, Corina Jimenez-Gomez, and Christopher A. Podlesnik. Effects of pay rate and in- structions on attrition in crowdsourcing research.PLOS ONE, 18(10):e0292372, October 2023. https://pmc. ncbi.nlm.nih.gov/articles/PMC10550147/
2023
-
[61]
The political preferences of LLMs
David Rozado. The political preferences of LLMs. PLOS ONE, 19(7):e0306621, July 2024. https:// journals.plos.org/plosone/article?id=10.13 71/journal.pone.0306621
2024
-
[62]
SAGE Publications Ltd, fifth edition, March 2025
Johnny Saldana.The Coding Manual for Qualitative Re- searchers. SAGE Publications Ltd, fifth edition, March 2025
2025
-
[63]
Whose Opinions Do Language Models Reflect? InProceedings of the 40th International Conference on Machine Learn- ing, pages 29971–30004
Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose Opinions Do Language Models Reflect? InProceedings of the 40th International Conference on Machine Learn- ing, pages 29971–30004. PMLR, July 2023. https: //proceedings.mlr.press/v202/santurkar23a. html
2023
-
[64]
It’s Something to Polish Your Own Thoughts, Rather than Create Thoughts for You
Steven Schirra, Sasha G V olkov, and Frank Bentley. "It’s Something to Polish Your Own Thoughts, Rather than Create Thoughts for You": Understanding Partic- ipants’ Use of Chatbots and LLMs During Online Re- search Participation. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–6. Ass...
-
[65]
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting.http://arxiv. org/abs/2310.11324, July 2024. arXiv:2310.11324
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Saijal Shahania, Myra Spiliopoulou, and David Broneske. Gotta Catch ’Em All... Or Not?: How LLMs Bypass Traditional Checks & Mimic Human Response Behavior in Web Surveys. InProceedings of the ACM Collective Intelligence Conference, pages 113–128, San Diego CA USA, August 2025. ACM. https: //dl.acm.org/doi/10.1145/3715928.3737491
-
[67]
Hagar Shimoni and Vadim Axelrod. Assessing the quality and reliability of the Amazon Mechanical Turk (MTurk) data in 2024.Royal Society Open Science, 12(7):250361, July 2025. https://royalsocietypu blishing.org/doi/full/10.1098/rsos.250361
-
[68]
Zbynˇek Šidák. Rectangular Confidence Regions for the Means of Multivariate Normal Distributions.Journal of the American Statistical Association, 62(318):626–633, June 1967. https://doi.org/10.1080/01621459.1 967.10482935
-
[69]
Lucy Stafford, Catherine Preston, and Alexandra C. Pike. Participant Use of Artificial Intelligence in Online Focus Groups: An Experiential Account.International Jour- nal of Qualitative Methods, 23:16094069241286417, November 2024. https://doi.org/10.1177/1609 4069241286417
-
[70]
Brian Jay Tang and Kang G. Shin. Eye-Shield: Real- Time Protection of Mobile Device Screen Information from Shoulder Surfing. In32nd USENIX Security Sym- posium (USENIX Security 23), pages 5449–5466, 2023. https://www.usenix.org/conference/usenixse curity23/presentation/tang
2023
-
[71]
Repli- cation: How Well Do My Results Generalize Now? The External Validity of Online Privacy and Security Surveys
Jenny Tang, Eleanor Birrell, and Ada Lerner. Repli- cation: How Well Do My Results Generalize Now? The External Validity of Online Privacy and Security Surveys. InEighteenth Symposium on Usable Pri- vacy and Security (SOUPS 2022), pages 367–385, 2022. https://www.usenix.org/conference/soups202 2/presentation/tang
2022
-
[72]
The Science of Detecting LLM-Generated Text.Commun
Ruixiang Tang, Yu-Neng Chuang, and Xia Hu. The Science of Detecting LLM-Generated Text.Commun. ACM, 67(4):50–59, March 2024. https://dl.acm.o rg/doi/10.1145/3624725
-
[73]
Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkwar, and Graham Neubig. Do LLMs Exhibit Human-like Response Biases? A Case Study in Sur- vey Design.Transactions of the Association for Com- putational Linguistics, 12:1011–1026, September 2024. https://doi.org/10.1162/tacl_a_00685
-
[74]
The threat of AI chatbot responses to crowdsourced open-ended survey questions.Energy Research & Social Science, 119:103857, January 2025
Frederic Traylor. The threat of AI chatbot responses to crowdsourced open-ended survey questions.Energy Research & Social Science, 119:103857, January 2025. https://linkinghub.elsevier.com/retrieve/p ii/S2214629624004481
2025
-
[75]
Cozzolino, Andrew Gordon, David Rothschild, and Robert West
Veniamin Veselovsky, Manoel Horta Ribeiro, Philip J. Cozzolino, Andrew Gordon, David Rothschild, and Robert West. Prevalence and Prevention of Large Lan- guage Model Use in Crowd Work.Commun. ACM, 68(3):42–47, February 2025. https://dl.acm.org/d oi/10.1145/3685527
-
[76]
Veniamin Veselovsky, Manoel Horta Ribeiro, and Robert West. Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks. http://arxiv.org/abs/2306.0 7899, June 2023. arXiv:2306.07899
-
[77]
Kumar, and Jason Pridmore
Jessica Vitak, Yuting Liao, Anouk Mols, Daniel Trottier, Michael Zimmer, Priya C. Kumar, and Jason Pridmore. When Do Data Collection and Use Become a Matter of Concern? A Cross-Cultural Comparison of U.S. and Dutch Privacy Attitudes.International Journal of Com- munication, 17(0):28, 2023. https://ijoc.org/ind ex.php/ijoc/article/view/19391
2023
-
[78]
Margaret A. Webb and June P. Tangney. Too Good to Be True: Bots and Bad Data From Mechanical Turk. Perspectives on Psychological Science, 19(6):887–890, November 2024. https://doi.org/10.1177/1745 6916221120027
-
[79]
Sean J. Westwood. The potential existential threat of large language models to online survey research. Proceedings of the National Academy of Sciences, 122(47):e2518075122, November 2025. https://www. pnas.org/doi/full/10.1073/pnas.2518075122
-
[80]
Wong, Shu Yang, Xinyi Yang, Yulin Yuan, and Lidia S
Junchao Wu, Runzhe Zhan, Derek F. Wong, Shu Yang, Xinyi Yang, Yulin Yuan, and Lidia S. Chao. Detec- tRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios.Advances in Neural Informa- tion Processing Systems, 37:100369–100401, December 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.