A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
hub
Can AI-Generated Text be Reliably Detected?
25 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large Language Models (LLMs) perform impressively well in various applications. However, the potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use. Consequently, the reliable detection of AI-generated text has become a critical area of research. AI text detectors have shown to be effective under their specific settings. In this paper, we stress-test the robustness of these AI text detectors in the presence of an attacker. We introduce recursive paraphrasing attack to stress test a wide range of detection schemes, including the ones using the watermarking as well as neural network-based detectors, zero shot classifiers, and retrieval-based detectors. Our experiments conducted on passages, each approximately 300 tokens long, reveal the varying sensitivities of these detectors to our attacks. Our findings indicate that while our recursive paraphrasing method can significantly reduce detection rates, it only slightly degrades text quality in many cases, highlighting potential vulnerabilities in current detection systems in the presence of an attacker. Additionally, we investigate the susceptibility of watermarked LLMs to spoofing attacks aimed at misclassifying human-written text as AI-generated. We demonstrate that an attacker can infer hidden AI text signatures without white-box access to the detection method, potentially leading to reputational risks for LLM developers. Finally, we provide a theoretical framework connecting the AUROC of the best possible detector to the Total Variation distance between human and AI text distributions. This analysis offers insights into the fundamental challenges of reliable detection as language models continue to advance. Our code is publicly available at https://github.com/vinusankars/Reliability-of-AI-text-detectors.
hub tools
citation-role summary
citation-polarity summary
roles
background 4representative citing papers
Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.
PeerPrism benchmark demonstrates that state-of-the-art LLM detectors conflate surface text style with intellectual contribution and fail on hybrid human-AI peer reviews.
Agentic AI eliminates the fidelity-scale tradeoff in deception, enabling the Infinite Impostor attack that hijacks trusted relationships at mass scale and requiring a shift to suspect-by-default security based on evaluating actions rather than actors.
SteganoPrompt embeds a hidden instruction in assignment prompts via the Unicode Tags block so that LLMs add a detectable signature to responses when the prompt is pasted verbatim.
Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.
LLM agents make collective belief dynamics programmable, with simulations showing coordinated agents induce stable belief shifts, and four structural properties that complicate detection and defense.
First high-rate public-key binary PRCs for edit channels via reduction from Hamming-robust PRCs and alphabet-size constructions attaining near-Singleton rates.
LLM-generated political discourse across crises is fluent yet caricatured: more negative, less emotionally varied, more structurally regular, and lexically abstract than observed online populations.
A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.
BREW achieves TPR of 0.965 and FPR of 0.02 under 10% synonym substitution by shifting from ECC decoding to designated verification with block voting and local validation.
DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
Adaptive Stealing improves watermark theft efficiency from LLMs via Position-Based Seal Construction and Adaptive Selection modules that dynamically choose optimal attack perspectives.
AI-generated text detectors achieve high benchmark accuracy by exploiting unstable dataset-specific linguistic features, as evidenced by cross-domain degradation and differing SHAP explanations across corpora.
RedNote-Vibe supplies a longitudinal dataset of AI versus human lifestyle posts from 2020 to mid-2025 plus the PLAD detection framework that applies cognitive psychology signatures for improved AI-text identification.
Reveals hidden human-like spans in machine-generated texts that raise detection complexity and proposes a stacked enhancement framework with hard-EM optimization to improve detectors across LLMs.
A multi-level framework that models local and global relations among token detection scores to improve machine-generated text detection with low overhead.
Chained rewrites by open-weight LLMs reduce watermark detection on diffusion LM outputs from 87.9% to 4.86% after five steps across multiple styles and models.
Industry practitioners indicate that generative AI heightens demand for prompting and output evaluation skills while reinforcing the value of problem-solving, critical thinking, architecture design, and debugging in software engineering.
Randomized per-query key selection with single-key detection acceptance bounds forgery success rate independently of collected samples while preserving model utility.
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
AI-saturated markets will produce premiums for verified human presence in labor, requiring governance to treat human-provenance verification as infrastructure rather than optional authenticity labels.
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
citing papers explorer
-
Who Owns This Agent? Tracing AI Agents Back to Their Owners
A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
-
Base Models Look Human To AI Detectors
Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.
-
PeerPrism: Peer Evaluation Expertise vs Review-writing AI
PeerPrism benchmark demonstrates that state-of-the-art LLM detectors conflate surface text style with intellectual contribution and fail on hybrid human-AI peer reviews.
-
The End of Trust: How Agentic AI Breaks Security Assumptions
Agentic AI eliminates the fidelity-scale tradeoff in deception, enabling the Infinite Impostor attack that hijacks trusted relationships at mass scale and requiring a shift to suspect-by-default security based on evaluating actions rather than actors.
-
Detecting Verbatim LLM Copy-Paste in Homework
SteganoPrompt embeds a hidden instruction in assignment prompts via the Unicode Tags block so that LLMs add a detectable signature to responses when the prompt is pasted verbatim.
-
Can AI-Generated Text be Reliably Detected?
Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.
-
LLM Agents Make Collective Belief Dynamics Programmable: Challenges and Research Directions
LLM agents make collective belief dynamics programmable, with simulations showing coordinated agents induce stable belief shifts, and four structural properties that complicate detection and defense.
-
High-Rate Public-Key Pseudorandom Codes for Edit Errors
First high-rate public-key binary PRCs for edit channels via reduction from Hamming-robust PRCs and alphabet-size constructions attaining near-Singleton rates.
-
The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events
LLM-generated political discourse across crises is fluent yet caricatured: more negative, less emotionally varied, more structurally regular, and lexically abstract than observed online populations.
-
Process Matters more than Output for Distinguishing Humans from Machines
A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.
-
Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking
BREW achieves TPR of 0.965 and FPR of 0.02 under 10% synonym substitution by shifting from ECC decoding to designated verification with block voting and local validation.
-
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis
DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
-
Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models
Adaptive Stealing improves watermark theft efficiency from LLMs via Position-Based Seal Construction and Adaptive Selection modules that dynamically choose optimal attack perspectives.
-
Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy
AI-generated text detectors achieve high benchmark accuracy by exploiting unstable dataset-specific linguistic features, as evidenced by cross-domain degradation and differing SHAP explanations across corpora.
-
RedNote-Vibe: A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Lifestyle Social Media
RedNote-Vibe supplies a longitudinal dataset of AI versus human lifestyle posts from 2020 to mid-2025 plus the PLAD detection framework that applies cognitive psychology signatures for improved AI-text identification.
-
Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement
Reveals hidden human-like spans in machine-generated texts that raise detection complexity and proposes a stacked enhancement framework with hard-EM optimization to improve detectors across LLMs.
-
Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection
A multi-level framework that models local and global relations among token detection scores to improve machine-generated text detection with low overhead.
-
Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks
Chained rewrites by open-weight LLMs reduce watermark detection on diffusion LM outputs from 87.9% to 4.86% after five steps across multiple styles and models.
-
"Don't Be Afraid, Just Learn": Insights from Industry Practitioners to Prepare Software Engineers in the Age of Generative AI
Industry practitioners indicate that generative AI heightens demand for prompting and output evaluation skills while reinforcing the value of problem-solving, critical thinking, architecture design, and debugging in software engineering.
-
Mitigating Watermark Forgery in Generative Models via Randomized Key Selection
Randomized per-query key selection with single-key detection acceptance bounds forgery success rate independently of collected samples while preserving model utility.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Human-Provenance Verification should be Treated as Labor Infrastructure in AI-Saturated Markets
AI-saturated markets will produce premiums for verified human presence in labor, requiring governance to treat human-provenance verification as infrastructure rather than optional authenticity labels.
-
From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
- Privacy-Preserving Proof of Human Authorship via Zero-Knowledge Process Attestation
- Detecting Cognitive Signatures in Typing Behavior for Non-Intrusive Authorship Verification