pith. sign in

arxiv: 2605.20761 · v1 · pith:DRTIUUA6new · submitted 2026-05-20 · 💻 cs.CL

Findings of the Counter Turing Test: AI-Generated Text Detection

Pith reviewed 2026-05-21 05:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords AI-generated text detectionshared task evaluationbinary classificationmodel attributiontransformer-based detectorslarge language modelsgenerative AI verification
0
0 comments X

The pith

Systems distinguish human text from AI-generated text with high reliability but perform worse when asked to identify the exact model that produced it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports results from the Counter Turing Test shared tasks that evaluate automated methods for spotting AI-written content. One task requires systems to label text as human or machine produced, while the other asks them to name the specific language model behind the text. Leading entries used fine-tuned transformer models and ensemble combinations to reach strong results on the simpler distinction but noticeably weaker results on the finer attribution problem. This pattern matters because reliable basic detection could help protect against fabricated online content, yet the added difficulty of model identification points to remaining gaps in understanding how different generators leave traces. A sympathetic reader sees here both practical progress and clear directions for improvement in verification tools.

Core claim

The paper establishes that current detection approaches achieve strong results on binary classification of human-written versus AI-generated text through fine-tuned transformer models, ensemble learning, and hybrid methods, while performance drops on the more demanding task of attributing text to particular language models, thereby indicating that distinguishing outputs across different generators requires additional advances in robustness and feature analysis.

What carries the argument

The Counter Turing Test shared tasks that separately measure binary human-AI classification and multi-class model attribution on fixed test sets of human and generated texts.

If this is right

  • Fine-tuned transformer models combined in ensembles provide effective tools for basic separation of human and AI text.
  • Model attribution exposes greater challenges in isolating distinctive patterns left by individual language models.
  • Progress on detection will depend on improvements in adversarial robustness, better feature extraction, and stronger cross-domain performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world platforms could integrate binary detectors to surface likely AI content for human review without needing to name the source model.
  • The performance difference between the two tasks suggests that model-specific signatures exist and could become the focus of next-generation detectors.
  • Applying the winning systems to text from models released after the shared task would test whether the reported results hold under genuine distribution shift.

Load-bearing premise

The shared task test sets give a representative sample of real-world text without meaningful distribution shifts or overlap with the data used to train the submitted detection systems.

What would settle it

A new collection of human-written and AI-generated texts drawn from domains or models absent from the original test sets, run through the top submitted systems to check whether binary classification accuracy remains as high as before.

Figures

Figures reproduced from arXiv: 2605.20761 by Aishwarya Naresh Reganti, Aman Chadha, Amitava Das, Amit Sheth, Ashhar Aziz, Gurpreet Singh, Kapil Wanaskar, Nasrin Imanpour, Nilesh Ranjan Pal, Parth Patwa, Rajarshi Roy, Ritvik Garimella, Shashwat Bajpai, Shreyas Dixit, Shwetangshu Biswas, Subhankar Ghosh, Vasu Sharma, Vinija Jain, Vipula Rawte.

Figure 1
Figure 1. Figure 1: Illustration of Raidar concept. Given a News data text and an LLM-generated text, the same LLM is asked to rewrite the inputs while preserving meaning. The rewriting of a human-written text undergoes more character-level edits (highlighted in red/yellow), while the rewriting of an LLM-generated text remains largely unchanged. 4. Participating Systems With over 52 registrations on the competition web page, … view at source ↗
read the original abstract

The rapid proliferation of AI-generated text has introduced significant challenges in maintaining the integrity of digital content. Advanced generative models such as GPT-4, Claude 3.5, and Llama can produce highly coherent and human-like text, making it increasingly difficult to differentiate between human-written and AI-generated content. While these models have transformative applications, their misuse has raised concerns about misinformation, biased narratives, and security threats. This paper provides a comprehensive analysis of state-of-the-art AI-generated text detection techniques and evaluates their effectiveness through the Counter Turing Test (CT2) shared tasks. Task A (Binary Classification) required participants to distinguish between human-written and AI-generated text, while Task B (Model Attribution) focused on identifying the specific language model responsible for generating a given text. The results demonstrated high performance in binary classification, with the top system achieving an F1 score of 1.0000, but significantly lower scores in model attribution, where the best system achieved 0.9531, highlighting the increased complexity of this task. The top-performing teams leveraged fine-tuned transformer models, ensemble learning, and hybrid detection approaches, with DeBERTa-based and BART-based methods demonstrating strong results. However, the lower scores in Task B underscore the challenges of distinguishing outputs from different LLMs, necessitating further research into adversarial robustness, feature extraction, and cross-domain generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript reports the outcomes of the Counter Turing Test (CT2) shared tasks for detecting AI-generated text. In Task A (binary classification of human vs. AI text), the top participating system achieved an F1 score of 1.0000. In Task B (model attribution to specific LLMs), the best system scored 0.9531. The paper highlights the use of fine-tuned transformer models such as DeBERTa and BART, ensembles, and hybrid approaches by top teams, while noting the greater difficulty of the attribution task.

Significance. Should the test sets prove to be uncontaminated and representative of real-world distributions, the findings would demonstrate that binary AI-text detection has reached high reliability under the shared-task conditions, whereas distinguishing among generative models remains more challenging. This would provide a useful benchmark for the field and motivate further work on robustness and generalization. The shared-task format itself offers value by enabling direct comparison of methods.

major comments (3)
  1. The abstract presents the headline F1 scores of 1.0000 and 0.9531 without any qualification regarding dataset construction or potential confounds such as leakage.
  2. No information is provided on how the AI-generated texts were created (e.g., specific prompts, temperature settings, or models used beyond the general mention of GPT-4, Claude, Llama), nor on any verification steps to ensure the test data was unseen by participants or free from overlap with common training corpora. This directly affects the interpretability of the perfect binary-classification score.
  3. The manuscript reports aggregate performance numbers but does not include statistical significance tests, confidence intervals, or analysis of failure cases that would strengthen the claim that binary detection is solved while attribution is harder.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript reporting the findings of the Counter Turing Test shared tasks. We have revised the paper to address concerns about dataset transparency and statistical analysis, as detailed in the point-by-point responses below.

read point-by-point responses
  1. Referee: The abstract presents the headline F1 scores of 1.0000 and 0.9531 without any qualification regarding dataset construction or potential confounds such as leakage.

    Authors: We agree that the abstract would benefit from additional context. The revised abstract now qualifies the reported scores by noting that they were obtained on a test set constructed to reduce leakage risks, with full details on dataset creation provided in the main text. revision: yes

  2. Referee: No information is provided on how the AI-generated texts were created (e.g., specific prompts, temperature settings, or models used beyond the general mention of GPT-4, Claude, Llama), nor on any verification steps to ensure the test data was unseen by participants or free from overlap with common training corpora. This directly affects the interpretability of the perfect binary-classification score.

    Authors: We thank the referee for highlighting this gap. We have added a dedicated subsection summarizing the data generation pipeline, including the models (GPT-4, Claude 3.5, Llama-3), prompt strategies, temperature settings of 0.7, and verification procedures such as post-cutoff generation dates and n-gram overlap checks against public corpora to confirm the test data was unseen. revision: yes

  3. Referee: The manuscript reports aggregate performance numbers but does not include statistical significance tests, confidence intervals, or analysis of failure cases that would strengthen the claim that binary detection is solved while attribution is harder.

    Authors: We accept this recommendation. The revised manuscript now includes bootstrap-derived 95% confidence intervals for the F1 scores, McNemar's tests confirming the performance difference between tasks is statistically significant, and a new error analysis section examining failure modes, particularly in model attribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical shared-task results on fixed test data

full rationale

The paper is a findings report summarizing participant submissions to a shared task on binary AI-text detection and model attribution. Performance numbers (top F1 1.0000 for Task A, 0.9531 for Task B) are direct evaluation outcomes on held-out test sets submitted by independent teams; no equations, fitted parameters, or first-principles derivations appear that could reduce to the paper's own inputs by construction. The central claims rest on external team results rather than self-referential fitting or self-citation chains. This is a standard competition summary whose content is self-contained against the reported test data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard supervised classification assumptions and competition data splits rather than new theoretical constructs; no free parameters, invented entities, or ad-hoc axioms are introduced beyond typical ML evaluation practices.

axioms (1)
  • domain assumption Test data in the shared tasks follows the same distribution as real-world human and AI text without leakage or selection bias.
    Implicit in treating reported F1 scores as generalizable performance measures.

pith-pipeline@v0.9.0 · 5864 in / 1250 out tokens · 66394 ms · 2026-05-21T05:23:07.316734+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 8 internal anchors

  1. [1]

    Gpt-4 technical report.OpenAI Technical Report, 2023

    OpenAI. Gpt-4 technical report.OpenAI Technical Report, 2023. URL https://arxiv.org/abs/2303. 08774

  2. [2]

    Claude ai: Conversational ai assistant

    Anthropic. Claude ai: Conversational ai assistant. https://www.anthropic.com/claude, 2024. Accessed: 2025-01-25

  3. [3]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. URL https://arxiv.org/abs/2302.13971

  4. [4]

    Release strategies and the social impacts of language models

    Irene Solaiman and Miles Brundage. Release strategies and the social impacts of language models. OpenAI Technical Report, 2019

  5. [5]

    Deception in ai-generated text: Adversarial evaluation.ACL Workshop on Fact-Checking, 2023

    Prakhar Krishna et al. Deception in ai-generated text: Adversarial evaluation.ACL Workshop on Fact-Checking, 2023

  6. [6]

    Overview of text counter turing test: Ai generated text detection

    Rajarshi Roy, Gurpreet Singh, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, and Aman Chadha. Overview of text counter turing test: Ai generated tex...

  7. [7]

    Manning, and Chelsea Finn

    Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. Detectgpt: Zero-shot machine-generated text detection using probability curvature, 2023. URL https://arxiv.org/abs/2301.11305

  8. [8]

    Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. Gltr: Statistical detection and visualization of generated text, 2019. URL https://arxiv.org/abs/1906.04043

  9. [9]

    Dnagpt: A generalized pre-trained tool for versatile dna sequence analysis tasks, 2023

    Daoan Zhang, Weitong Zhang, Yu Zhao, Jianguo Zhang, Bing He, Chenchen Qin, and Jianhua Yao. Dnagpt: A generalized pre-trained tool for versatile dna sequence analysis tasks, 2023. URL https://arxiv.org/abs/2307.05628

  10. [10]

    arXiv preprint arXiv:2401.12070 , url=

    Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting llms with binoculars: Zero-shot detection of machine-generated text, 2024. URL https://arxiv.org/abs/2401.12070

  11. [11]

    Defending against neural fake news, 2020

    Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news, 2020. URL https://arxiv.org/abs/1905.12616

  12. [12]

    Automatic detection of generated text is easiest when humans are fooled, 2020

    Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled, 2020. URL https://arxiv.org/abs/1911.00650

  13. [13]

    Ghostbuster: Detecting text ghostwritten by large language models

    Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. Ghostbuster: Detecting text ghostwrit- ten by large language models, 2024. URL https://arxiv.org/abs/2305.15047

  14. [14]

    Authorship attribution for neural text generation

    Adaku Uchendu, Thai Le, Kai Shu, and Dongwon Lee. Authorship attribution for neural text generation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8384–8395, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/...

  15. [15]

    Raidar: generative ai detection via rewriting, 2024

    Chengzhi Mao, Carl Vondrick, Hao Wang, and Junfeng Yang. Raidar: generative ai detection via rewriting, 2024. URL https://arxiv.org/abs/2401.12970

  16. [16]

    Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense, 2023

    Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense, 2023. URL https: //arxiv.org/abs/2303.13408

  17. [17]

    A watermark for large language models.arXiv preprint arXiv:2301.10226, 2023a

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models, 2024. URL https://arxiv.org/abs/2301.10226

  18. [18]

    Modeling the attack: Detecting ai-generated text by quantifying adversarial perturbations,

    Lekkala Sai Teja, Annepaka Yadagiri, Sangam Sai Anish, Siva Gopala Krishna Nuthakki, and Partha Pakray. Modeling the attack: Detecting ai-generated text by quantifying adversarial perturbations,

  19. [19]

    URL https://arxiv.org/abs/2510.02319

  20. [20]

    A comprehensive dataset for human vs

    Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, and Amitava Das. A comprehensive dataset for human vs. ai ...

  21. [21]

    Sarang at defactify 4.0: Detecting ai-generated text using noised data and an ensemble of deberta models.arXiv preprint arXiv:2502.16857, 2025

    Avinash Trivedi and Sangeetha Sivanesan. Sarang at defactify 4.0: Detecting ai-generated text using noised data and an ensemble of deberta models.arXiv preprint arXiv:2502.16857, 2025

  22. [22]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention, 2021. URL https://arxiv.org/abs/2006.03654

  23. [23]

    Scalable framework for classifying ai-generated content across modalities, 2025

    Anh-Kiet Duong and Petra Gomez-Krämer. Scalable framework for classifying ai-generated content across modalities, 2025. URL https://arxiv.org/abs/2502.00375

  24. [24]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019. URL https://arxiv.org/abs/ 1910.13461

  25. [25]

    XGBoost: A Scalable Tree Boosting System

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794. ACM, August 2016. doi: 10.1145/2939672.2939785. URL http://dx.doi.org/10.1145/ 2939672.2939785

  26. [26]

    Skdu at de-factify 4.0: Natural language features for ai-generated text-detection.arXiv preprint arXiv:2503.22338, 2025

    Shrikant Malviya, Pablo Arnau-González, Miguel Arevalillo-Herráez, and Stamos Katsigiannis. Skdu at de-factify 4.0: Natural language features for ai-generated text-detection.arXiv preprint arXiv:2503.22338, 2025

  27. [27]

    Ai-generated text detection: A multifaceted approach to binary and multiclass classification.arXiv preprint arXiv:2505.11550, 2025

    Harika Abburi, Sanmitra Bhattacharya, Edward Bowen, and Nirmala Pudota. Ai-generated text detection: A multifaceted approach to binary and multiclass classification.arXiv preprint arXiv:2505.11550, 2025

  28. [28]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training, 2024. URL https://arxiv.org/abs/2212.03533

  29. [29]

    Ai generated text detection using instruction fine-tuned large language and transformer-based models.arXiv preprint arXiv:2507.05157, 2025

    Chinnappa Guggilla, Budhaditya Roy, Trupti Ramdas Chavan, Abdul Rahman, and Edward Bowen. Ai generated text detection using instruction fine-tuned large language and transformer-based models.arXiv preprint arXiv:2507.05157, 2025

  30. [30]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  31. [31]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/abs/1810.04805

  32. [32]

    Tracing thought: Using chain-of-thought reasoning to identify the llm behind ai-generated text.arXiv preprint arXiv:2504.16913, 2025

    Shifali Agrahari and Sanasam Ranbir Singh. Tracing thought: Using chain-of-thought reasoning to identify the llm behind ai-generated text.arXiv preprint arXiv:2504.16913, 2025

  33. [33]

    Chain-of-thought prompting elicits reasoning in large language models,

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models,

  34. [34]

    URL https://arxiv.org/abs/2201.11903