Decoding AI Authorship: Can LLMs Truly Mimic Human Style Across Literature and Politics?

Nasser A Alsadhan

arxiv: 2603.23219 · v1 · submitted 2026-03-24 · 💻 cs.CL · cs.LG

Decoding AI Authorship: Can LLMs Truly Mimic Human Style Across Literature and Politics?

Nasser A Alsadhan This is my paper

Pith reviewed 2026-05-15 00:39 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords AI-generated textstylometric analysisauthorship attributionperplexityzero-shot promptingLLM detectionXGBoostdigital humanities

0 comments

The pith

AI text mimicking human authors remains highly detectable even with simple models on just eight features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether top large language models can copy the distinct writing styles of figures such as Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. It creates AI samples through zero-shot prompts that match the same topics as the originals, then measures how well those samples blend in using both neural classifiers and simpler interpretable models. The central result is that current AI outputs stay distinguishable, mainly because they show lower variability in word predictability than human text does. Perplexity stands out as the strongest single signal of this gap, while the models do match humans on basic measures like sentence complexity. The work supplies a concrete benchmark for where stylistic mimicry still falls short and what that means for spotting AI content in literature or public discourse.

Core claim

State-of-the-art LLMs including GPT-4o, Gemini 1.5 Pro, and Claude Sonnet 3.5 produce text that diverges from the authorial signatures of Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. When evaluated with zero-shot prompting under strict thematic alignment, the synthetic corpora are readily separated from human text by an XGBoost model trained on only eight stylometric features, reaching accuracy levels comparable to high-dimensional BERT classifiers. Feature importance analysis singles out perplexity as the leading discriminator, reflecting reduced stochastic variability in AI outputs relative to the greater variance found in human writing. Although LLMs converge with the

What carries the argument

XGBoost classifier on a restricted set of eight stylometric features (LIWC markers, perplexity, readability indices) where perplexity ranks as the top discriminator of AI versus human text.

If this is right

AI outputs converge with human text on low-dimensional features such as syntactic complexity and readability but diverge on affective density and stylistic variance.
Perplexity alone functions as a primary, interpretable metric for authorship attribution tasks in digital humanities and political text.
Simple tree-based models can match the performance of transformer classifiers when the goal is detecting generative mimicry.
Current LLMs do not yet reproduce the higher stochastic irregularity typical of human-authored corpora.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detection pipelines built on these eight features could be adapted for real-time screening of AI-generated social media or news content.
Future model training that explicitly encourages higher output variability might reduce the current gap in stylistic variance.
Testing the same feature set on few-shot prompting or on additional authors could show whether detectability changes with more context.

Load-bearing premise

Zero-shot prompting with strict thematic alignment is sufficient to expose the true limits of current LLM stylistic mimicry without any author-specific training examples.

What would settle it

Fine-tune the tested LLMs on author-specific samples, regenerate text under the same prompts, and check whether detection accuracy with the eight-feature XGBoost model falls well below the levels reported for zero-shot outputs.

read the original abstract

Amidst the rising capabilities of generative AI to mimic specific human styles, this study investigates the ability of state-of-the-art large language models (LLMs), including GPT-4o, Gemini 1.5 Pro, and Claude Sonnet 3.5, to emulate the authorial signatures of prominent literary and political figures: Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. Utilizing a zero-shot prompting framework with strict thematic alignment, we generated synthetic corpora evaluated through a complementary framework combining transformer-based classification (BERT) and interpretable machine learning (XGBoost). Our methodology integrates Linguistic Inquiry and Word Count (LIWC) markers, perplexity, and readability indices to assess the divergence between AI-generated and human-authored text. Results demonstrate that AI-generated mimicry remains highly detectable, with XGBoost models trained on a restricted set of eight stylometric features achieving accuracy comparable to high-dimensional neural classifiers. Feature importance analyses identify perplexity as the primary discriminative metric, revealing a significant divergence in the stochastic regularity of AI outputs compared to the higher variability of human writing. While LLMs exhibit distributional convergence with human authors on low-dimensional heuristic features, such as syntactic complexity and readability, they do not yet fully replicate the nuanced affective density and stylistic variance inherent in the human-authored corpus. By isolating the specific statistical gaps in current generative mimicry, this study provides a comprehensive benchmark for LLM stylistic behavior and offers critical insights for authorship attribution in the digital humanities and social media.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows current LLMs stay detectable from human authors via perplexity even with simple features, but zero-shot prompting leaves open whether better elicitation would close the gap.

read the letter

The core finding is that GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet still produce text distinguishable from Whitman, Wordsworth, Trump, and Obama when prompted zero-shot on matching themes. Perplexity emerges as the strongest signal, letting an XGBoost model on eight stylometric features match BERT accuracy while staying interpretable. The work pairs neural and tree-based classifiers, adds LIWC and readability checks, and reports that AI text shows lower variance than the human samples on affective and stylistic measures. This combination of recent models with both literary and political targets is the concrete extension here, and the feature-importance step is a useful addition for anyone building detection tools. The pipeline is coherent on its own terms and the perplexity result aligns with what we already know about model stochasticity. The main limitation is the lack of reported sample sizes, exact prompt text, statistical tests, or error bars, which makes it hard to judge how stable the accuracy numbers really are. The zero-shot setup with thematic alignment is also a soft spot: stronger prompting with examples or iterative refinement might reduce the observed gaps, so the claim that LLMs cannot yet match human variance rests partly on an untested assumption about elicitation. This is useful for people working on authorship attribution or content moderation who need a current benchmark. It is not a new method but it supplies a practical data point on where the models still diverge. I would send it to peer review so referees can check the experimental details and prompting choices.

Referee Report

1 major / 2 minor

Summary. The paper claims that state-of-the-art LLMs (GPT-4o, Gemini 1.5 Pro, Claude Sonnet 3.5) cannot fully replicate the authorial styles of Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. Using zero-shot prompting with thematic alignment to generate synthetic text, it evaluates divergence via a complementary pipeline of BERT-based classification and XGBoost on eight stylometric features (LIWC markers, perplexity, readability indices). Results show high detectability, with XGBoost on the restricted feature set matching neural classifier accuracy and perplexity emerging as the dominant signal; LLMs converge on low-dimensional heuristics but diverge in affective density and stylistic variance.

Significance. If the central detectability result holds, the work supplies a practical benchmark for LLM stylistic behavior and actionable insights for authorship attribution in digital humanities and social media. Credit is due for the complementary neural-plus-interpretable classifier design and the explicit feature-importance analysis that isolates perplexity as the primary discriminator.

major comments (1)

[§4] §4 (Methodology): The claim that LLMs 'do not yet fully replicate the nuanced affective density and stylistic variance' rests on zero-shot prompting alone. No ablation against few-shot examples, style-specific instructions, or iterative refinement is reported; stronger elicitation could narrow the perplexity and variance gaps, rendering the observed limits protocol-dependent rather than a general property of current models.

minor comments (2)

[§5] §5 (Results): Sample sizes per author, total corpus sizes, exact prompt templates, and generation hyperparameters are not reported, and no error bars or statistical significance tests accompany the accuracy and feature-importance figures.
[§5] §5 (Results): The statement that XGBoost on eight features achieves 'accuracy comparable' to high-dimensional neural classifiers would be strengthened by an explicit side-by-side table of precision, recall, and F1 scores rather than a qualitative claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights an important methodological consideration. We address the single major comment below and will revise the manuscript to incorporate the suggested clarification.

read point-by-point responses

Referee: [§4] §4 (Methodology): The claim that LLMs 'do not yet fully replicate the nuanced affective density and stylistic variance' rests on zero-shot prompting alone. No ablation against few-shot examples, style-specific instructions, or iterative refinement is reported; stronger elicitation could narrow the perplexity and variance gaps, rendering the observed limits protocol-dependent rather than a general property of current models.

Authors: We appreciate this observation. Our deliberate use of zero-shot prompting was intended to evaluate the models' inherent stylistic mimicry capabilities without any in-context examples or additional guidance, providing a stringent baseline that reflects real-world deployment scenarios where explicit style exemplars are unavailable. This protocol choice aligns with common practices in LLM evaluation for authorship tasks. We agree, however, that the reported limits are tied to zero-shot conditions and that stronger elicitation methods could potentially reduce the observed gaps. In the revised manuscript, we will expand §4 to explicitly state this scope limitation, clarify that results pertain to zero-shot prompting, and add a forward-looking discussion recommending future ablations with few-shot examples and iterative refinement as valuable extensions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical classifier evaluation on independent stylometric features

full rationale

The paper reports an empirical study that trains XGBoost and BERT classifiers on held-out stylometric features (LIWC, perplexity, readability) extracted from zero-shot generated versus human text. No equations, derivations, or fitted parameters are presented that reduce the detectability claim to a quantity defined from the same data by construction. Feature importance is computed post-training on the evaluation set, and the central result (perplexity as top discriminator) follows directly from standard ML analysis without self-referential reduction or load-bearing self-citation chains. The methodology is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The study rests on standard assumptions about prompting and feature extraction rather than new postulates.

free parameters (1)

selection of eight stylometric features
The restricted feature set for XGBoost is chosen from a larger possible pool; the exact selection process is not detailed in the abstract.

axioms (1)

domain assumption Zero-shot prompting with thematic alignment elicits representative stylistic mimicry from the LLMs
The framework assumes the prompts are sufficient to test the models' capacity without further adaptation.

pith-pipeline@v0.9.0 · 5563 in / 1191 out tokens · 24947 ms · 2026-05-15T00:39:26.372441+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Feature importance analyses identify perplexity as the primary discriminative metric, revealing a significant divergence in the stochastic regularity of AI outputs compared to the higher variability of human writing.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

XGBoost models trained on a restricted set of eight stylometric features achieving accuracy comparable to high-dimensional neural classifiers.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.