Findings of the Counter Turing Test: AI-Generated Text Detection
Pith reviewed 2026-05-21 05:23 UTC · model grok-4.3
The pith
Systems distinguish human text from AI-generated text with high reliability but perform worse when asked to identify the exact model that produced it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that current detection approaches achieve strong results on binary classification of human-written versus AI-generated text through fine-tuned transformer models, ensemble learning, and hybrid methods, while performance drops on the more demanding task of attributing text to particular language models, thereby indicating that distinguishing outputs across different generators requires additional advances in robustness and feature analysis.
What carries the argument
The Counter Turing Test shared tasks that separately measure binary human-AI classification and multi-class model attribution on fixed test sets of human and generated texts.
If this is right
- Fine-tuned transformer models combined in ensembles provide effective tools for basic separation of human and AI text.
- Model attribution exposes greater challenges in isolating distinctive patterns left by individual language models.
- Progress on detection will depend on improvements in adversarial robustness, better feature extraction, and stronger cross-domain performance.
Where Pith is reading between the lines
- Real-world platforms could integrate binary detectors to surface likely AI content for human review without needing to name the source model.
- The performance difference between the two tasks suggests that model-specific signatures exist and could become the focus of next-generation detectors.
- Applying the winning systems to text from models released after the shared task would test whether the reported results hold under genuine distribution shift.
Load-bearing premise
The shared task test sets give a representative sample of real-world text without meaningful distribution shifts or overlap with the data used to train the submitted detection systems.
What would settle it
A new collection of human-written and AI-generated texts drawn from domains or models absent from the original test sets, run through the top submitted systems to check whether binary classification accuracy remains as high as before.
Figures
read the original abstract
The rapid proliferation of AI-generated text has introduced significant challenges in maintaining the integrity of digital content. Advanced generative models such as GPT-4, Claude 3.5, and Llama can produce highly coherent and human-like text, making it increasingly difficult to differentiate between human-written and AI-generated content. While these models have transformative applications, their misuse has raised concerns about misinformation, biased narratives, and security threats. This paper provides a comprehensive analysis of state-of-the-art AI-generated text detection techniques and evaluates their effectiveness through the Counter Turing Test (CT2) shared tasks. Task A (Binary Classification) required participants to distinguish between human-written and AI-generated text, while Task B (Model Attribution) focused on identifying the specific language model responsible for generating a given text. The results demonstrated high performance in binary classification, with the top system achieving an F1 score of 1.0000, but significantly lower scores in model attribution, where the best system achieved 0.9531, highlighting the increased complexity of this task. The top-performing teams leveraged fine-tuned transformer models, ensemble learning, and hybrid detection approaches, with DeBERTa-based and BART-based methods demonstrating strong results. However, the lower scores in Task B underscore the challenges of distinguishing outputs from different LLMs, necessitating further research into adversarial robustness, feature extraction, and cross-domain generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports the outcomes of the Counter Turing Test (CT2) shared tasks for detecting AI-generated text. In Task A (binary classification of human vs. AI text), the top participating system achieved an F1 score of 1.0000. In Task B (model attribution to specific LLMs), the best system scored 0.9531. The paper highlights the use of fine-tuned transformer models such as DeBERTa and BART, ensembles, and hybrid approaches by top teams, while noting the greater difficulty of the attribution task.
Significance. Should the test sets prove to be uncontaminated and representative of real-world distributions, the findings would demonstrate that binary AI-text detection has reached high reliability under the shared-task conditions, whereas distinguishing among generative models remains more challenging. This would provide a useful benchmark for the field and motivate further work on robustness and generalization. The shared-task format itself offers value by enabling direct comparison of methods.
major comments (3)
- The abstract presents the headline F1 scores of 1.0000 and 0.9531 without any qualification regarding dataset construction or potential confounds such as leakage.
- No information is provided on how the AI-generated texts were created (e.g., specific prompts, temperature settings, or models used beyond the general mention of GPT-4, Claude, Llama), nor on any verification steps to ensure the test data was unseen by participants or free from overlap with common training corpora. This directly affects the interpretability of the perfect binary-classification score.
- The manuscript reports aggregate performance numbers but does not include statistical significance tests, confidence intervals, or analysis of failure cases that would strengthen the claim that binary detection is solved while attribution is harder.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the manuscript reporting the findings of the Counter Turing Test shared tasks. We have revised the paper to address concerns about dataset transparency and statistical analysis, as detailed in the point-by-point responses below.
read point-by-point responses
-
Referee: The abstract presents the headline F1 scores of 1.0000 and 0.9531 without any qualification regarding dataset construction or potential confounds such as leakage.
Authors: We agree that the abstract would benefit from additional context. The revised abstract now qualifies the reported scores by noting that they were obtained on a test set constructed to reduce leakage risks, with full details on dataset creation provided in the main text. revision: yes
-
Referee: No information is provided on how the AI-generated texts were created (e.g., specific prompts, temperature settings, or models used beyond the general mention of GPT-4, Claude, Llama), nor on any verification steps to ensure the test data was unseen by participants or free from overlap with common training corpora. This directly affects the interpretability of the perfect binary-classification score.
Authors: We thank the referee for highlighting this gap. We have added a dedicated subsection summarizing the data generation pipeline, including the models (GPT-4, Claude 3.5, Llama-3), prompt strategies, temperature settings of 0.7, and verification procedures such as post-cutoff generation dates and n-gram overlap checks against public corpora to confirm the test data was unseen. revision: yes
-
Referee: The manuscript reports aggregate performance numbers but does not include statistical significance tests, confidence intervals, or analysis of failure cases that would strengthen the claim that binary detection is solved while attribution is harder.
Authors: We accept this recommendation. The revised manuscript now includes bootstrap-derived 95% confidence intervals for the F1 scores, McNemar's tests confirming the performance difference between tasks is statistically significant, and a new error analysis section examining failure modes, particularly in model attribution. revision: yes
Circularity Check
No circularity: empirical shared-task results on fixed test data
full rationale
The paper is a findings report summarizing participant submissions to a shared task on binary AI-text detection and model attribution. Performance numbers (top F1 1.0000 for Task A, 0.9531 for Task B) are direct evaluation outcomes on held-out test sets submitted by independent teams; no equations, fitted parameters, or first-principles derivations appear that could reduce to the paper's own inputs by construction. The central claims rest on external team results rather than self-referential fitting or self-citation chains. This is a standard competition summary whose content is self-contained against the reported test data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Test data in the shared tasks follows the same distribution as real-world human and AI text without leakage or selection bias.
Reference graph
Works this paper leans on
-
[1]
Gpt-4 technical report.OpenAI Technical Report, 2023
OpenAI. Gpt-4 technical report.OpenAI Technical Report, 2023. URL https://arxiv.org/abs/2303. 08774
work page 2023
-
[2]
Claude ai: Conversational ai assistant
Anthropic. Claude ai: Conversational ai assistant. https://www.anthropic.com/claude, 2024. Accessed: 2025-01-25
work page 2024
-
[3]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. URL https://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Release strategies and the social impacts of language models
Irene Solaiman and Miles Brundage. Release strategies and the social impacts of language models. OpenAI Technical Report, 2019
work page 2019
-
[5]
Deception in ai-generated text: Adversarial evaluation.ACL Workshop on Fact-Checking, 2023
Prakhar Krishna et al. Deception in ai-generated text: Adversarial evaluation.ACL Workshop on Fact-Checking, 2023
work page 2023
-
[6]
Overview of text counter turing test: Ai generated text detection
Rajarshi Roy, Gurpreet Singh, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, and Aman Chadha. Overview of text counter turing test: Ai generated tex...
work page 2025
-
[7]
Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. Detectgpt: Zero-shot machine-generated text detection using probability curvature, 2023. URL https://arxiv.org/abs/2301.11305
-
[8]
Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. Gltr: Statistical detection and visualization of generated text, 2019. URL https://arxiv.org/abs/1906.04043
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[9]
Dnagpt: A generalized pre-trained tool for versatile dna sequence analysis tasks, 2023
Daoan Zhang, Weitong Zhang, Yu Zhao, Jianguo Zhang, Bing He, Chenchen Qin, and Jianhua Yao. Dnagpt: A generalized pre-trained tool for versatile dna sequence analysis tasks, 2023. URL https://arxiv.org/abs/2307.05628
-
[10]
arXiv preprint arXiv:2401.12070 , url=
Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting llms with binoculars: Zero-shot detection of machine-generated text, 2024. URL https://arxiv.org/abs/2401.12070
-
[11]
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news, 2020. URL https://arxiv.org/abs/1905.12616
-
[12]
Automatic detection of generated text is easiest when humans are fooled, 2020
Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled, 2020. URL https://arxiv.org/abs/1911.00650
-
[13]
Ghostbuster: Detecting text ghostwritten by large language models
Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. Ghostbuster: Detecting text ghostwrit- ten by large language models, 2024. URL https://arxiv.org/abs/2305.15047
-
[14]
Authorship attribution for neural text generation
Adaku Uchendu, Thai Le, Kai Shu, and Dongwon Lee. Authorship attribution for neural text generation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8384–8395, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/...
-
[15]
Raidar: generative ai detection via rewriting, 2024
Chengzhi Mao, Carl Vondrick, Hao Wang, and Junfeng Yang. Raidar: generative ai detection via rewriting, 2024. URL https://arxiv.org/abs/2401.12970
-
[16]
Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense, 2023
Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense, 2023. URL https: //arxiv.org/abs/2303.13408
-
[17]
A watermark for large language models.arXiv preprint arXiv:2301.10226, 2023a
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models, 2024. URL https://arxiv.org/abs/2301.10226
-
[18]
Modeling the attack: Detecting ai-generated text by quantifying adversarial perturbations,
Lekkala Sai Teja, Annepaka Yadagiri, Sangam Sai Anish, Siva Gopala Krishna Nuthakki, and Partha Pakray. Modeling the attack: Detecting ai-generated text by quantifying adversarial perturbations,
- [19]
-
[20]
A comprehensive dataset for human vs
Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, and Amitava Das. A comprehensive dataset for human vs. ai ...
-
[21]
Avinash Trivedi and Sangeetha Sivanesan. Sarang at defactify 4.0: Detecting ai-generated text using noised data and an ensemble of deberta models.arXiv preprint arXiv:2502.16857, 2025
-
[22]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention, 2021. URL https://arxiv.org/abs/2006.03654
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Scalable framework for classifying ai-generated content across modalities, 2025
Anh-Kiet Duong and Petra Gomez-Krämer. Scalable framework for classifying ai-generated content across modalities, 2025. URL https://arxiv.org/abs/2502.00375
-
[24]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019. URL https://arxiv.org/abs/ 1910.13461
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[25]
XGBoost: A Scalable Tree Boosting System
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794. ACM, August 2016. doi: 10.1145/2939672.2939785. URL http://dx.doi.org/10.1145/ 2939672.2939785
-
[26]
Shrikant Malviya, Pablo Arnau-González, Miguel Arevalillo-Herráez, and Stamos Katsigiannis. Skdu at de-factify 4.0: Natural language features for ai-generated text-detection.arXiv preprint arXiv:2503.22338, 2025
-
[27]
Harika Abburi, Sanmitra Bhattacharya, Edward Bowen, and Nirmala Pudota. Ai-generated text detection: A multifaceted approach to binary and multiclass classification.arXiv preprint arXiv:2505.11550, 2025
-
[28]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training, 2024. URL https://arxiv.org/abs/2212.03533
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Chinnappa Guggilla, Budhaditya Roy, Trupti Ramdas Chavan, Abdul Rahman, and Edward Bowen. Ai generated text detection using instruction fine-tuned large language and transformer-based models.arXiv preprint arXiv:2507.05157, 2025
-
[30]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/abs/1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[32]
Shifali Agrahari and Sanasam Ranbir Singh. Tracing thought: Using chain-of-thought reasoning to identify the llm behind ai-generated text.arXiv preprint arXiv:2504.16913, 2025
-
[33]
Chain-of-thought prompting elicits reasoning in large language models,
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models,
-
[34]
URL https://arxiv.org/abs/2201.11903
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.