arxiv: 2604.14111 · v1 · submitted 2026-04-15 · 💻 cs.CL

Recognition: unknown

Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies

Swati Rallapalli , Shannon Gallagher , Ronald Yurko , Tyler Brooks , Chuck Loughin , Michele Sezgin , Violet Turri

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM-generated textstylistic variationBiber featureshuman vs machine writinggenre effectsdecoding strategiesmodel comparison

0 comments

The pith

Stylistic markers that separate LLM text from human writing stay stable even under prompts and decoding meant to make it sound human.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how stylistic features differ between human writing and text generated by large language models across eight genres. It applies a fixed set of linguistic measurements to outputs from eleven models and four decoding approaches. The results show that core linguistic signals of machine-generated text do not vanish when models are told to imitate humans or to continue human-written text. Genre turns out to shape stylistic patterns more than whether the source is human or machine. Model identity affects style more than the choice of decoding strategy, with chat-tuned versions tending to cluster together.

Core claim

Using Douglas Biber's lexicogrammatical and functional features, the analysis shows that key differentiators between human and LLM text are robust to variations in generation conditions such as prompts aimed at human-like output or providing human text for continuation. Genre has a stronger effect on stylistic features than the source, chat model variants cluster together stylistically, and the choice of model influences style more than the decoding strategy does, though with some exceptions.

What carries the argument

Douglas Biber's set of lexicogrammatical and functional features, which counts specific linguistic structures to place texts in a multidimensional stylistic space.

If this is right

Stylistic detection methods relying on these features may continue to work even when users optimize prompts for naturalness.
Genre-specific adaptation of models could prove more effective for controlling output style than general human-imitation prompts.
Chat variants of models produce stylistically similar outputs to one another even when base models differ.
Changes in decoding parameters produce smaller stylistic shifts than switching between different models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detection tools built on these features might generalize more reliably to everyday use cases than approaches sensitive to prompting tricks.
Extending the analysis to include semantic embeddings could test whether genre dominance holds in meaning-related dimensions of text.
Writers using LLMs for varied tasks may benefit more from selecting the right model than from adjusting generation settings.

Load-bearing premise

That Douglas Biber's lexicogrammatical and functional features adequately and without bias capture the stylistically relevant dimensions of variation between human and LLM text across the sampled genres and conditions.

What would settle it

Finding that a different feature set focused on semantic or discourse patterns shows large shifts in human-LLM differences when human-mimicking prompts or continuations are used would indicate the reported robustness is limited to this particular measurement approach.

Figures

Figures reproduced from arXiv: 2604.14111 by Chuck Loughin, Michele Sezgin, Ronald Yurko, Shannon Gallagher, Swati Rallapalli, Tyler Brooks, Violet Turri.

**Figure 2.** Figure 2: Feature Distribution for f_43_type_token (with 200 trees) using Biber features on the RAID data. The dataset is inherently imbalanced, as each human-written document is paired with outputs from 11 LLMs. Therefore, we train two models: (i) a base model using the full imbalanced data, and (ii) a down-sampled model in which the majority class (MGT) is reduced. We use a 60%/20%/20% train/validation/test split.… view at source ↗

**Figure 4.** Figure 4: Ratio of Feature Usage LLM Vs. Human for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Heat map of Log10 Biber Feature Ratios (LLM / Human) for Selected Important Features [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Scatterplot of first two PCA dimensions of mean Biber features. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Relaxed consensus tree summarizing hierar [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Visualizations of Biber features across selected RAID genres. Colors indicate model identity, while [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: ROC curve comparing base model and down [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution of Biber features (Part 1). [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Distribution of Biber features (Part 2). [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Ratio of Feature Usage LLM Vs. Human for All Features. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Heat map of Log10 Biber Feature Ratios (LLM / Human) for All Features [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Hierarchical clustering dendrograms using PCA with components capturing 95% variance on normalized [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: PCA visualizations of Biber features across eight RAID categories. Colors indicate model identity, while [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are now capable of generating highly fluent, human-like text. They enable many applications, but also raise concerns such as large scale spam, phishing, or academic misuse. While much work has focused on detecting LLM-generated text, only limited work has gone into understanding the stylistic differences between human-written and machine-generated text. In this work, we perform a large scale analysis of stylistic variation across human-written text and outputs from 11 LLMs spanning 8 different genres and 4 decoding strategies using Douglas Biber's set of lexicogrammatical and functional features. Our findings reveal insights that can guide intentional LLM usage. First, key linguistic differentiators of LLM-generated text seem robust to generation conditions (e.g., prompt settings to nudge them to generate human-like text, or availability of human-written text to continue the style); second, genre exerts a stronger influence on stylistic features than the source itself; third, chat variants of the models generally appear to be clustered together in stylistic space, and finally, model has a larger effect on the style than decoding strategy, with some exceptions. These results highlight the relative importance of model and genre over prompting and decoding strategies in shaping the stylistic behavior of machine-generated text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Genre and model choice shape LLM stylistic features more than prompting or decoding, but the analysis rests on Biber's human-calibrated features without checking for missing LLM-specific signals.

read the letter

The one or two things to know are that the stylistic markers separating human and LLM writing hold up under different prompts and continuations, and that genre has a bigger effect than whether the text is human or machine. Model choice also matters more than decoding strategy. The paper does a good job laying out a large comparison. They take 11 different LLMs, run them on 8 genres with 4 decoding methods, extract Biber's lexicogrammatical features, and then look at how the texts cluster and how much each factor explains the variance. Seeing the chat models group together and the relative effect sizes is helpful for anyone trying to control or detect generated text. The soft spot is the feature set. Biber's inventory was calibrated on human registers, so it might miss dimensions where LLMs differ most, such as repetition rates or unusual n-gram distributions. Without checking against those or doing an ablation, the claims about robustness and the dominance of genre could be specific to this coordinate system rather than general. The abstract also skips sample sizes and statistical details, which makes it tough to judge the strength of the results. This work is aimed at people building or studying LLM applications where style consistency matters. A reader who needs practical advice on whether tweaking the prompt will change the output style will get value from the rankings. It deserves a serious referee because the questions are relevant and the design covers enough ground to be informative, though revisions should address the feature choice and add the missing stats. I would send this to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a large-scale observational study projecting human-written and LLM-generated texts (11 models, 8 genres, 4 decoding strategies) onto Douglas Biber's 67 lexicogrammatical and functional features, then applies PCA, clustering, and variance partitioning. It concludes that key stylistic differentiators are robust to prompting/continuation conditions, genre exerts stronger influence than source (human vs. LLM), chat-model variants cluster together, and model identity has larger effect than decoding strategy.

Significance. If the Biber feature inventory adequately spans the relevant stylistic dimensions, the scale of the comparison across genres, models, and strategies supplies useful guidance on which factors most shape LLM output style. The explicit use of an established, interpretable feature set (rather than black-box embeddings) is a methodological strength that supports the paper's claim to interpretability.

major comments (2)

[§3] §3 (Feature Extraction and Analysis Pipeline): All headline results (robustness to prompts/continuations, genre > source, chat clustering, model > decoding) are obtained by projecting texts onto Biber's 67 features calibrated on human register variation. No ablation, external validation against LLM-specific markers (repetition, burstiness, formulaic n-gram overuse), or comparison to alternative feature sets is reported. This choice is load-bearing for the central claims; if primary human-LLM differences lie outside the Biber coordinate system, the reported relative effect sizes and robustness conclusions become artifacts of the chosen representation rather than intrinsic properties of the texts.
[Results] Results section (variance partitioning and clustering): The manuscript reports directional findings on relative influence of factors but supplies no sample sizes per cell, statistical tests, effect sizes, or controls for prompt length. Without these, it is impossible to evaluate whether the claimed dominance of genre over source, or model over decoding, exceeds what would be expected by chance or by the feature set's own biases.

minor comments (2)

[Abstract] The abstract and results would benefit from explicit statement of the total number of texts analyzed and the distribution across genres/models/strategies.
[Figures] Figure captions and axis labels in the PCA and clustering visualizations should include the percentage of variance explained by each component.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment point by point below. We agree that additional statistical details and explicit discussion of feature-set limitations will strengthen the paper.

read point-by-point responses

Referee: §3 (Feature Extraction and Analysis Pipeline): All headline results (robustness to prompts/continuations, genre > source, chat clustering, model > decoding) are obtained by projecting texts onto Biber's 67 features calibrated on human register variation. No ablation, external validation against LLM-specific markers (repetition, burstiness, formulaic n-gram overuse), or comparison to alternative feature sets is reported. This choice is load-bearing for the central claims; if primary human-LLM differences lie outside the Biber coordinate system, the reported relative effect sizes and robustness conclusions become artifacts of the chosen representation rather than intrinsic properties of the texts.

Authors: We chose Biber's 67 lexicogrammatical features because they are an established, linguistically interpretable inventory for register and stylistic variation, directly supporting the paper's goal of providing transparent, human-readable insights into how LLM outputs align with or diverge from known human genre patterns. This is a deliberate methodological decision favoring interpretability over potentially higher detection power from LLM-specific markers. We acknowledge that features such as repetition or burstiness may capture additional dimensions not represented in the Biber space. In the revision we will add an explicit limitations subsection discussing this choice and its implications, along with suggestions for complementary feature sets in future work. A full ablation or external validation study lies outside the scope of the current revision. revision: partial
Referee: Results section (variance partitioning and clustering): The manuscript reports directional findings on relative influence of factors but supplies no sample sizes per cell, statistical tests, effect sizes, or controls for prompt length. Without these, it is impossible to evaluate whether the claimed dominance of genre over source, or model over decoding, exceeds what would be expected by chance or by the feature set's own biases.

Authors: We agree that the results section would benefit from greater statistical transparency. In the revised manuscript we will add: (1) a table reporting exact sample sizes per genre-model-decoding cell, (2) permutation-based or ANOVA-style tests with p-values for the variance-partitioning results to assess whether observed factor dominance exceeds chance, (3) effect-size measures (e.g., eta-squared or partial R²), and (4) clarification in the methods that all texts were length-controlled via truncation to a common maximum or inclusion of token length as a covariate. These additions will allow readers to evaluate the robustness of the reported relative influences. revision: yes

Circularity Check

0 steps flagged

No circularity in purely observational empirical analysis

full rationale

The paper performs a descriptive statistical comparison by extracting Douglas Biber's pre-existing lexicogrammatical and functional features from human and LLM texts, then applying PCA, clustering, and variance partitioning to the resulting feature vectors. No equations, fitted parameters, predictions, or derivations appear in the abstract or described methods; results are reported as direct empirical outcomes from the chosen coordinate system rather than quantities forced by construction or self-referential steps. The work is self-contained against external benchmarks because the feature inventory is independently established prior work and the analysis contains no load-bearing self-citations or ansatzes that reduce the central claims to tautologies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis depends on the validity of a pre-existing linguistic feature inventory and the assumption that the chosen genres and models are representative; no new entities or fitted constants are introduced.

axioms (1)

domain assumption Douglas Biber's lexicogrammatical and functional features validly and comprehensively measure stylistic variation in both human and LLM text.
The paper applies this established framework without re-deriving or validating its coverage for the new domain of LLM output.

pith-pipeline@v0.9.0 · 5541 in / 1278 out tokens · 30497 ms · 2026-05-10T12:44:51.092115+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Douglas Biber

Investigating macroscopic tex- tual variation through multifeature/multidimensional analyses.Linguistics, 23(2):337–360. Douglas Biber. 1988.Variation Across Speech and Writing. Cambridge University Press. Douglas Biber. 1995.Dimensions of Register Varia- tion: A Cross-Linguistic Comparison. Cambridge University Press. Douglas Biber

1988
[2]

Accessed: 2025-09-21

https: //ai.meta.com/blog/meta-llama-3/ . Accessed: 2025-09-21. Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn

2025
[3]

GPT-4 Technical Report

Gpt-4 technical report.Preprint, arXiv:2303.08774. [Distribution Statement A] Approved for public release and unlimited distribution. 9 OpenAI. 2024a. Gpt-4o mini: Advancing cost-efficient intelligence. https://openai.com/index/gpt-4 o-mini-advancing-cost-efficient-intellige nce/. Accessed: 2025-09-21. OpenAI. 2024b. Hello gpt-4o. https://openai.com /inde...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

StyleDistance: Stronger content-independent style embeddings with synthetic parallel examples. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8662–8685, Albuquerque, New Mexico. Asso- ciation for Computational Linguis...

2025
[5]

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-V oss, Jeff Wu, Alec Radford, and Jasmine Wang

A linguistic comparison between hu- man and chatgpt-generated conversations.Preprint, arXiv:2401.16587. Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-V oss, Jeff Wu, Alec Radford, and Jasmine Wang

work page arXiv
[6]

Release Strategies and the Social Impacts of Language Models

Release strategies and the social impacts of language models.CoRR, abs/1908.09203. Giovanni Spitale, Nikola Biller-Andorno, and Federico Germani

work page internal anchor Pith review arXiv 1908
[7]

Linguistic characteristics of ai-generated text: A survey.arXiv preprint arXiv:2510.05136, 2025

Linguis- tic characteristics of ai-generated text: A survey. Preprint, arXiv:2510.05136. Sergio E. Zanotto and Segun Aroyehun

work page arXiv
[8]

machine consistency: A linguistic anal- ysis of texts generated by humans and large language models.arXiv preprint arXiv:2412.03025

Human variability vs. machine consistency: A linguistic anal- ysis of texts generated by humans and large language models.Preprint, arXiv:2412.03025. Sergio E. Zanotto and Segun Aroyehun

work page arXiv
[9]

InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22841–22858, Suzhou, China

Linguis- tic and embedding-based profiling of texts generated by humans and large language models. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22841–22858, Suzhou, China. Association for Computational Lin- guistics. Brook Zeleke, Amish Soni, and Lydia Manikonda

2025
[10]

InCompanion Publication of the 17th ACM Web Science Conference 2025, Websci Companion ’25, page 34–37, New York, NY , USA

Human or genai? characterizing the linguistic differ- ences between human-written and llm-generated text. InCompanion Publication of the 17th ACM Web Science Conference 2025, Websci Companion ’25, page 34–37, New York, NY , USA. Association for Computing Machinery. A Appendix This section provides additional visualizations that support and extend the resu...

2025