Recognition: unknown
Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies
Pith reviewed 2026-05-10 12:44 UTC · model grok-4.3
The pith
Stylistic markers that separate LLM text from human writing stay stable even under prompts and decoding meant to make it sound human.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using Douglas Biber's lexicogrammatical and functional features, the analysis shows that key differentiators between human and LLM text are robust to variations in generation conditions such as prompts aimed at human-like output or providing human text for continuation. Genre has a stronger effect on stylistic features than the source, chat model variants cluster together stylistically, and the choice of model influences style more than the decoding strategy does, though with some exceptions.
What carries the argument
Douglas Biber's set of lexicogrammatical and functional features, which counts specific linguistic structures to place texts in a multidimensional stylistic space.
If this is right
- Stylistic detection methods relying on these features may continue to work even when users optimize prompts for naturalness.
- Genre-specific adaptation of models could prove more effective for controlling output style than general human-imitation prompts.
- Chat variants of models produce stylistically similar outputs to one another even when base models differ.
- Changes in decoding parameters produce smaller stylistic shifts than switching between different models.
Where Pith is reading between the lines
- Detection tools built on these features might generalize more reliably to everyday use cases than approaches sensitive to prompting tricks.
- Extending the analysis to include semantic embeddings could test whether genre dominance holds in meaning-related dimensions of text.
- Writers using LLMs for varied tasks may benefit more from selecting the right model than from adjusting generation settings.
Load-bearing premise
That Douglas Biber's lexicogrammatical and functional features adequately and without bias capture the stylistically relevant dimensions of variation between human and LLM text across the sampled genres and conditions.
What would settle it
Finding that a different feature set focused on semantic or discourse patterns shows large shifts in human-LLM differences when human-mimicking prompts or continuations are used would indicate the reported robustness is limited to this particular measurement approach.
Figures
read the original abstract
Large Language Models (LLMs) are now capable of generating highly fluent, human-like text. They enable many applications, but also raise concerns such as large scale spam, phishing, or academic misuse. While much work has focused on detecting LLM-generated text, only limited work has gone into understanding the stylistic differences between human-written and machine-generated text. In this work, we perform a large scale analysis of stylistic variation across human-written text and outputs from 11 LLMs spanning 8 different genres and 4 decoding strategies using Douglas Biber's set of lexicogrammatical and functional features. Our findings reveal insights that can guide intentional LLM usage. First, key linguistic differentiators of LLM-generated text seem robust to generation conditions (e.g., prompt settings to nudge them to generate human-like text, or availability of human-written text to continue the style); second, genre exerts a stronger influence on stylistic features than the source itself; third, chat variants of the models generally appear to be clustered together in stylistic space, and finally, model has a larger effect on the style than decoding strategy, with some exceptions. These results highlight the relative importance of model and genre over prompting and decoding strategies in shaping the stylistic behavior of machine-generated text.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts a large-scale observational study projecting human-written and LLM-generated texts (11 models, 8 genres, 4 decoding strategies) onto Douglas Biber's 67 lexicogrammatical and functional features, then applies PCA, clustering, and variance partitioning. It concludes that key stylistic differentiators are robust to prompting/continuation conditions, genre exerts stronger influence than source (human vs. LLM), chat-model variants cluster together, and model identity has larger effect than decoding strategy.
Significance. If the Biber feature inventory adequately spans the relevant stylistic dimensions, the scale of the comparison across genres, models, and strategies supplies useful guidance on which factors most shape LLM output style. The explicit use of an established, interpretable feature set (rather than black-box embeddings) is a methodological strength that supports the paper's claim to interpretability.
major comments (2)
- [§3] §3 (Feature Extraction and Analysis Pipeline): All headline results (robustness to prompts/continuations, genre > source, chat clustering, model > decoding) are obtained by projecting texts onto Biber's 67 features calibrated on human register variation. No ablation, external validation against LLM-specific markers (repetition, burstiness, formulaic n-gram overuse), or comparison to alternative feature sets is reported. This choice is load-bearing for the central claims; if primary human-LLM differences lie outside the Biber coordinate system, the reported relative effect sizes and robustness conclusions become artifacts of the chosen representation rather than intrinsic properties of the texts.
- [Results] Results section (variance partitioning and clustering): The manuscript reports directional findings on relative influence of factors but supplies no sample sizes per cell, statistical tests, effect sizes, or controls for prompt length. Without these, it is impossible to evaluate whether the claimed dominance of genre over source, or model over decoding, exceeds what would be expected by chance or by the feature set's own biases.
minor comments (2)
- [Abstract] The abstract and results would benefit from explicit statement of the total number of texts analyzed and the distribution across genres/models/strategies.
- [Figures] Figure captions and axis labels in the PCA and clustering visualizations should include the percentage of variance explained by each component.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment point by point below. We agree that additional statistical details and explicit discussion of feature-set limitations will strengthen the paper.
read point-by-point responses
-
Referee: §3 (Feature Extraction and Analysis Pipeline): All headline results (robustness to prompts/continuations, genre > source, chat clustering, model > decoding) are obtained by projecting texts onto Biber's 67 features calibrated on human register variation. No ablation, external validation against LLM-specific markers (repetition, burstiness, formulaic n-gram overuse), or comparison to alternative feature sets is reported. This choice is load-bearing for the central claims; if primary human-LLM differences lie outside the Biber coordinate system, the reported relative effect sizes and robustness conclusions become artifacts of the chosen representation rather than intrinsic properties of the texts.
Authors: We chose Biber's 67 lexicogrammatical features because they are an established, linguistically interpretable inventory for register and stylistic variation, directly supporting the paper's goal of providing transparent, human-readable insights into how LLM outputs align with or diverge from known human genre patterns. This is a deliberate methodological decision favoring interpretability over potentially higher detection power from LLM-specific markers. We acknowledge that features such as repetition or burstiness may capture additional dimensions not represented in the Biber space. In the revision we will add an explicit limitations subsection discussing this choice and its implications, along with suggestions for complementary feature sets in future work. A full ablation or external validation study lies outside the scope of the current revision. revision: partial
-
Referee: Results section (variance partitioning and clustering): The manuscript reports directional findings on relative influence of factors but supplies no sample sizes per cell, statistical tests, effect sizes, or controls for prompt length. Without these, it is impossible to evaluate whether the claimed dominance of genre over source, or model over decoding, exceeds what would be expected by chance or by the feature set's own biases.
Authors: We agree that the results section would benefit from greater statistical transparency. In the revised manuscript we will add: (1) a table reporting exact sample sizes per genre-model-decoding cell, (2) permutation-based or ANOVA-style tests with p-values for the variance-partitioning results to assess whether observed factor dominance exceeds chance, (3) effect-size measures (e.g., eta-squared or partial R²), and (4) clarification in the methods that all texts were length-controlled via truncation to a common maximum or inclusion of token length as a covariate. These additions will allow readers to evaluate the robustness of the reported relative influences. revision: yes
Circularity Check
No circularity in purely observational empirical analysis
full rationale
The paper performs a descriptive statistical comparison by extracting Douglas Biber's pre-existing lexicogrammatical and functional features from human and LLM texts, then applying PCA, clustering, and variance partitioning to the resulting feature vectors. No equations, fitted parameters, predictions, or derivations appear in the abstract or described methods; results are reported as direct empirical outcomes from the chosen coordinate system rather than quantities forced by construction or self-referential steps. The work is self-contained against external benchmarks because the feature inventory is independently established prior work and the analysis contains no load-bearing self-citations or ansatzes that reduce the central claims to tautologies.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Douglas Biber's lexicogrammatical and functional features validly and comprehensively measure stylistic variation in both human and LLM text.
Reference graph
Works this paper leans on
-
[1]
Douglas Biber
Investigating macroscopic tex- tual variation through multifeature/multidimensional analyses.Linguistics, 23(2):337–360. Douglas Biber. 1988.Variation Across Speech and Writing. Cambridge University Press. Douglas Biber. 1995.Dimensions of Register Varia- tion: A Cross-Linguistic Comparison. Cambridge University Press. Douglas Biber
1988
-
[2]
Accessed: 2025-09-21
https: //ai.meta.com/blog/meta-llama-3/ . Accessed: 2025-09-21. Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn
2025
-
[3]
Gpt-4 technical report.Preprint, arXiv:2303.08774. [Distribution Statement A] Approved for public release and unlimited distribution. 9 OpenAI. 2024a. Gpt-4o mini: Advancing cost-efficient intelligence. https://openai.com/index/gpt-4 o-mini-advancing-cost-efficient-intellige nce/. Accessed: 2025-09-21. OpenAI. 2024b. Hello gpt-4o. https://openai.com /inde...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
StyleDistance: Stronger content-independent style embeddings with synthetic parallel examples. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8662–8685, Albuquerque, New Mexico. Asso- ciation for Computational Linguis...
2025
-
[5]
A linguistic comparison between hu- man and chatgpt-generated conversations.Preprint, arXiv:2401.16587. Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-V oss, Jeff Wu, Alec Radford, and Jasmine Wang
-
[6]
Release Strategies and the Social Impacts of Language Models
Release strategies and the social impacts of language models.CoRR, abs/1908.09203. Giovanni Spitale, Nikola Biller-Andorno, and Federico Germani
work page internal anchor Pith review arXiv 1908
-
[7]
Linguistic characteristics of ai-generated text: A survey.arXiv preprint arXiv:2510.05136, 2025
Linguis- tic characteristics of ai-generated text: A survey. Preprint, arXiv:2510.05136. Sergio E. Zanotto and Segun Aroyehun
-
[8]
Human variability vs. machine consistency: A linguistic anal- ysis of texts generated by humans and large language models.Preprint, arXiv:2412.03025. Sergio E. Zanotto and Segun Aroyehun
-
[9]
InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22841–22858, Suzhou, China
Linguis- tic and embedding-based profiling of texts generated by humans and large language models. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22841–22858, Suzhou, China. Association for Computational Lin- guistics. Brook Zeleke, Amish Soni, and Lydia Manikonda
2025
-
[10]
InCompanion Publication of the 17th ACM Web Science Conference 2025, Websci Companion ’25, page 34–37, New York, NY , USA
Human or genai? characterizing the linguistic differ- ences between human-written and llm-generated text. InCompanion Publication of the 17th ACM Web Science Conference 2025, Websci Companion ’25, page 34–37, New York, NY , USA. Association for Computing Machinery. A Appendix This section provides additional visualizations that support and extend the resu...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.