pith. sign in

arxiv: 2603.23219 · v1 · submitted 2026-03-24 · 💻 cs.CL · cs.LG

Decoding AI Authorship: Can LLMs Truly Mimic Human Style Across Literature and Politics?

Pith reviewed 2026-05-15 00:39 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords AI-generated textstylometric analysisauthorship attributionperplexityzero-shot promptingLLM detectionXGBoostdigital humanities
0
0 comments X

The pith

AI text mimicking human authors remains highly detectable even with simple models on just eight features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether top large language models can copy the distinct writing styles of figures such as Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. It creates AI samples through zero-shot prompts that match the same topics as the originals, then measures how well those samples blend in using both neural classifiers and simpler interpretable models. The central result is that current AI outputs stay distinguishable, mainly because they show lower variability in word predictability than human text does. Perplexity stands out as the strongest single signal of this gap, while the models do match humans on basic measures like sentence complexity. The work supplies a concrete benchmark for where stylistic mimicry still falls short and what that means for spotting AI content in literature or public discourse.

Core claim

State-of-the-art LLMs including GPT-4o, Gemini 1.5 Pro, and Claude Sonnet 3.5 produce text that diverges from the authorial signatures of Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. When evaluated with zero-shot prompting under strict thematic alignment, the synthetic corpora are readily separated from human text by an XGBoost model trained on only eight stylometric features, reaching accuracy levels comparable to high-dimensional BERT classifiers. Feature importance analysis singles out perplexity as the leading discriminator, reflecting reduced stochastic variability in AI outputs relative to the greater variance found in human writing. Although LLMs converge with the

What carries the argument

XGBoost classifier on a restricted set of eight stylometric features (LIWC markers, perplexity, readability indices) where perplexity ranks as the top discriminator of AI versus human text.

If this is right

  • AI outputs converge with human text on low-dimensional features such as syntactic complexity and readability but diverge on affective density and stylistic variance.
  • Perplexity alone functions as a primary, interpretable metric for authorship attribution tasks in digital humanities and political text.
  • Simple tree-based models can match the performance of transformer classifiers when the goal is detecting generative mimicry.
  • Current LLMs do not yet reproduce the higher stochastic irregularity typical of human-authored corpora.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Detection pipelines built on these eight features could be adapted for real-time screening of AI-generated social media or news content.
  • Future model training that explicitly encourages higher output variability might reduce the current gap in stylistic variance.
  • Testing the same feature set on few-shot prompting or on additional authors could show whether detectability changes with more context.

Load-bearing premise

Zero-shot prompting with strict thematic alignment is sufficient to expose the true limits of current LLM stylistic mimicry without any author-specific training examples.

What would settle it

Fine-tune the tested LLMs on author-specific samples, regenerate text under the same prompts, and check whether detection accuracy with the eight-feature XGBoost model falls well below the levels reported for zero-shot outputs.

read the original abstract

Amidst the rising capabilities of generative AI to mimic specific human styles, this study investigates the ability of state-of-the-art large language models (LLMs), including GPT-4o, Gemini 1.5 Pro, and Claude Sonnet 3.5, to emulate the authorial signatures of prominent literary and political figures: Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. Utilizing a zero-shot prompting framework with strict thematic alignment, we generated synthetic corpora evaluated through a complementary framework combining transformer-based classification (BERT) and interpretable machine learning (XGBoost). Our methodology integrates Linguistic Inquiry and Word Count (LIWC) markers, perplexity, and readability indices to assess the divergence between AI-generated and human-authored text. Results demonstrate that AI-generated mimicry remains highly detectable, with XGBoost models trained on a restricted set of eight stylometric features achieving accuracy comparable to high-dimensional neural classifiers. Feature importance analyses identify perplexity as the primary discriminative metric, revealing a significant divergence in the stochastic regularity of AI outputs compared to the higher variability of human writing. While LLMs exhibit distributional convergence with human authors on low-dimensional heuristic features, such as syntactic complexity and readability, they do not yet fully replicate the nuanced affective density and stylistic variance inherent in the human-authored corpus. By isolating the specific statistical gaps in current generative mimicry, this study provides a comprehensive benchmark for LLM stylistic behavior and offers critical insights for authorship attribution in the digital humanities and social media.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that state-of-the-art LLMs (GPT-4o, Gemini 1.5 Pro, Claude Sonnet 3.5) cannot fully replicate the authorial styles of Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. Using zero-shot prompting with thematic alignment to generate synthetic text, it evaluates divergence via a complementary pipeline of BERT-based classification and XGBoost on eight stylometric features (LIWC markers, perplexity, readability indices). Results show high detectability, with XGBoost on the restricted feature set matching neural classifier accuracy and perplexity emerging as the dominant signal; LLMs converge on low-dimensional heuristics but diverge in affective density and stylistic variance.

Significance. If the central detectability result holds, the work supplies a practical benchmark for LLM stylistic behavior and actionable insights for authorship attribution in digital humanities and social media. Credit is due for the complementary neural-plus-interpretable classifier design and the explicit feature-importance analysis that isolates perplexity as the primary discriminator.

major comments (1)
  1. [§4] §4 (Methodology): The claim that LLMs 'do not yet fully replicate the nuanced affective density and stylistic variance' rests on zero-shot prompting alone. No ablation against few-shot examples, style-specific instructions, or iterative refinement is reported; stronger elicitation could narrow the perplexity and variance gaps, rendering the observed limits protocol-dependent rather than a general property of current models.
minor comments (2)
  1. [§5] §5 (Results): Sample sizes per author, total corpus sizes, exact prompt templates, and generation hyperparameters are not reported, and no error bars or statistical significance tests accompany the accuracy and feature-importance figures.
  2. [§5] §5 (Results): The statement that XGBoost on eight features achieves 'accuracy comparable' to high-dimensional neural classifiers would be strengthened by an explicit side-by-side table of precision, recall, and F1 scores rather than a qualitative claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights an important methodological consideration. We address the single major comment below and will revise the manuscript to incorporate the suggested clarification.

read point-by-point responses
  1. Referee: [§4] §4 (Methodology): The claim that LLMs 'do not yet fully replicate the nuanced affective density and stylistic variance' rests on zero-shot prompting alone. No ablation against few-shot examples, style-specific instructions, or iterative refinement is reported; stronger elicitation could narrow the perplexity and variance gaps, rendering the observed limits protocol-dependent rather than a general property of current models.

    Authors: We appreciate this observation. Our deliberate use of zero-shot prompting was intended to evaluate the models' inherent stylistic mimicry capabilities without any in-context examples or additional guidance, providing a stringent baseline that reflects real-world deployment scenarios where explicit style exemplars are unavailable. This protocol choice aligns with common practices in LLM evaluation for authorship tasks. We agree, however, that the reported limits are tied to zero-shot conditions and that stronger elicitation methods could potentially reduce the observed gaps. In the revised manuscript, we will expand §4 to explicitly state this scope limitation, clarify that results pertain to zero-shot prompting, and add a forward-looking discussion recommending future ablations with few-shot examples and iterative refinement as valuable extensions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical classifier evaluation on independent stylometric features

full rationale

The paper reports an empirical study that trains XGBoost and BERT classifiers on held-out stylometric features (LIWC, perplexity, readability) extracted from zero-shot generated versus human text. No equations, derivations, or fitted parameters are presented that reduce the detectability claim to a quantity defined from the same data by construction. Feature importance is computed post-training on the evaluation set, and the central result (perplexity as top discriminator) follows directly from standard ML analysis without self-referential reduction or load-bearing self-citation chains. The methodology is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The study rests on standard assumptions about prompting and feature extraction rather than new postulates.

free parameters (1)
  • selection of eight stylometric features
    The restricted feature set for XGBoost is chosen from a larger possible pool; the exact selection process is not detailed in the abstract.
axioms (1)
  • domain assumption Zero-shot prompting with thematic alignment elicits representative stylistic mimicry from the LLMs
    The framework assumes the prompts are sufficient to test the models' capacity without further adaptation.

pith-pipeline@v0.9.0 · 5563 in / 1191 out tokens · 24947 ms · 2026-05-15T00:39:26.372441+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.