A Comprehensive Dataset for Human vs. AI Generated Text Detection

Aishwarya Naresh Reganti; Aman Chadha; Amitava Das; Amit Sheth; Ashhar Aziz; Gaytri Jena; Gurpreet Singh; Kapil Wanaskar; Nasrin Imanpour; Nilesh Ranjan Pal

arxiv: 2510.22874 · v3 · pith:3KFQ6ZWYnew · submitted 2025-10-26 · 💻 cs.CL

A Comprehensive Dataset for Human vs. AI Generated Text Detection

Rajarshi Roy , Gurpreet Singh , Ashhar Aziz , Shashwat Bajpai , Nasrin Imanpour , Shwetangshu Biswas , Kapil Wanaskar , Parth Patwa

show 12 more authors

Subhankar Ghosh Shreyas Dixit Nilesh Ranjan Pal Vipula Rawte Ritvik Garimella Gaytri Jena Amitava Das Amit Sheth Vasu Sharma Aishwarya Naresh Reganti Vinija Jain Aman Chadha

This is my paper

classification 💻 cs.CL

keywords datasettextmodelsai-generatedaccuracyattributingcomprehensivecontent

0 comments

read the original abstract

The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 73,193 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35\%, and attributing AI texts to their generating models with an accuracy of 8.92\%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Text_Dataset

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Findings of the Counter Turing Test: AI-Generated Text Detection
cs.CL 2026-05 unverdicted novelty 2.0

Shared task findings show F1=1.0000 for binary AI text detection and 0.9531 for model attribution using fine-tuned DeBERTa and BART transformers with ensembles.