Temporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories
Pith reviewed 2026-05-10 15:02 UTC · model grok-4.3
The pith
LLMs exhibit reduced semantic and emotional drift over time compared to human writers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a dataset of 412 human authors and 6,086 documents from 2012 to 2024 across academic abstracts, blogs, and news, the authors generate matching trajectories from three LLMs under independent and history-conditioned settings. Drift and variance metrics applied to semantic, lexical, and cognitive-emotional representations show that LLMs produce greater lexical diversity but substantially lower semantic and cognitive-emotional drift than humans. The pattern holds across generation modes, and temporal variability alone distinguishes human from LLM trajectories at 94 percent accuracy and 98 percent ROC-AUC.
What carries the argument
Temporal flattening, measured as reduced drift and variance across time in semantic, lexical, and cognitive-emotional representations of document sequences.
Load-bearing premise
The chosen drift and variance metrics over semantic, lexical, and cognitive-emotional representations measure genuine temporal structure rather than artifacts of the embeddings or domain selection.
What would settle it
Demonstrating human-comparable semantic and cognitive-emotional drift levels in trajectories from future LLMs that incorporate explicit long-term memory mechanisms would show the flattening is not inherent.
Figures
read the original abstract
Large language models (LLMs) are increasingly used in daily applications, from content generation to code writing, where each interaction treats the model as stateless, generating responses independently without memory. Yet human writing is inherently longitudinal: authors' styles and cognitive states evolve across months and years. This raises a central question: can LLMs reproduce such temporal structure across extended time periods? We construct and publicly release a longitudinal dataset of 412 human authors and 6,086 documents spanning 2012--2024 across three domains (academic abstracts, blogs, news) and compare them to trajectories generated by three representative LLMs under standard and history-conditioned generation settings. Using drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations, we find temporal flattening in LLM-generated text. LLMs produce greater lexical diversity but exhibit substantially reduced semantic and cognitive-emotional drift relative to humans. These differences are highly predictive: temporal variability patterns alone achieve 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories. Our results demonstrate that temporal flattening persists regardless of whether LLMs generate independently or with access to incremental history, revealing a fundamental property of current deployment paradigms. This gap has direct implications for applications requiring authentic temporal structure, such as synthetic training data and longitudinal text modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript constructs and releases a longitudinal dataset of 412 human authors and 6,086 documents (2012-2024) across academic abstracts, blogs, and news. It generates matching trajectories from three LLMs under independent and history-conditioned settings, then applies drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations. The central finding is temporal flattening in LLM output: greater lexical diversity but substantially reduced semantic and cognitive-emotional drift relative to humans. These patterns alone yield 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories, persisting across generation conditions.
Significance. If the separation is robust to representation choices and domain controls, the work supplies quantitative evidence that current stateless or short-context LLM deployment paradigms cannot reproduce the longitudinal evolution characteristic of human writing. This has direct consequences for synthetic training data, author-style modeling, and longitudinal NLP tasks. The public dataset release is a clear strength that supports reproducibility and follow-on work.
major comments (2)
- [Methods] Methods section: The drift and variance metrics are applied to three representation families, yet no ablation is reported on alternative embeddings or on explicit domain-balance controls across the 412-author corpus. If the reduced semantic/cognitive drift is partly an artifact of how the chosen embeddings encode LLM outputs (rather than genuine flattening), the 94% classification accuracy would not establish the claimed temporal property. This is load-bearing for the headline result.
- [Results] Results section (classification experiments): The abstract states that temporal variability patterns achieve 94% accuracy and 98% ROC-AUC, but it is unclear whether the drift/variance thresholds were pre-specified before seeing the data or tuned afterward. If the latter, the separation may be inflated; explicit reporting of pre-registration or nested cross-validation is required to confirm the metric is not post-hoc.
minor comments (2)
- [Abstract] Abstract: The claim of persistence 'regardless of whether LLMs generate independently or with access to incremental history' should be accompanied by a brief quantitative statement of the accuracy drop (or lack thereof) under the history-conditioned condition.
- [Dataset Construction] Dataset description: Provide a table or paragraph confirming the number of documents per domain and per author to allow readers to assess balance.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review, as well as for recognizing the potential significance of the temporal flattening findings and the value of the released dataset. We address each major comment below and will revise the manuscript accordingly to strengthen the claims.
read point-by-point responses
-
Referee: [Methods] Methods section: The drift and variance metrics are applied to three representation families, yet no ablation is reported on alternative embeddings or on explicit domain-balance controls across the 412-author corpus. If the reduced semantic/cognitive drift is partly an artifact of how the chosen embeddings encode LLM outputs (rather than genuine flattening), the 94% classification accuracy would not establish the claimed temporal property. This is load-bearing for the headline result.
Authors: We agree that additional robustness checks are necessary to rule out embedding-specific artifacts. The three representation families (semantic sentence embeddings, lexical TF-IDF, and cognitive-emotional lexicons) were chosen to align with established longitudinal text analysis practices, and the flattening pattern holds across all three. Nevertheless, to directly address this concern we will add (i) ablations using alternative embeddings (e.g., different sentence-transformer variants and static word embeddings) and (ii) explicit domain-balanced subsampling experiments that enforce equal representation of academic, blog, and news documents per author. These results will be reported in a new subsection of the Methods and Results sections. revision: yes
-
Referee: [Results] Results section (classification experiments): The abstract states that temporal variability patterns achieve 94% accuracy and 98% ROC-AUC, but it is unclear whether the drift/variance thresholds were pre-specified before seeing the data or tuned afterward. If the latter, the separation may be inflated; explicit reporting of pre-registration or nested cross-validation is required to confirm the metric is not post-hoc.
Authors: The drift and variance thresholds were computed from the empirical distribution of human trajectories on a held-out training partition and then applied to a disjoint test partition; no hyperparameter search was performed on the test data. To eliminate any remaining ambiguity about post-hoc selection, we will replace the single-threshold classifier with a nested cross-validation procedure (outer loop for evaluation, inner loop for threshold selection) and report the resulting accuracy and ROC-AUC with confidence intervals. We will also add a statement clarifying the train/test split protocol in the revised Results section. revision: yes
Circularity Check
No circularity: empirical metric application on independent trajectories
full rationale
The paper constructs an external longitudinal dataset of human authors and generates LLM trajectories under explicit settings, then applies standard drift and variance metrics over semantic/lexical/cognitive representations to measure differences. The reported 94% accuracy and 98% ROC-AUC are direct empirical outcomes of a downstream classifier on these computed features, with no equations or steps that reduce the distinction to a fitted parameter by construction, self-definition of the target quantity, or load-bearing self-citation. The derivation chain remains self-contained against the released data and generation protocols.
Axiom & Free-Parameter Ledger
free parameters (1)
- drift and variance thresholds
axioms (1)
- domain assumption Human writing trajectories exhibit measurable longitudinal evolution in semantic, lexical, and cognitive-emotional dimensions.
Reference graph
Works this paper leans on
-
[1]
Diachronic word embeddings reveal statisti- cal laws of semantic change. InProceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1489–1501, Berlin, Germany. Association for Com- putational Linguistics. Baixiang Huang, Canyu Chen, and Kai Shu. 2024. Can large language models identify author...
-
[2]
Hong Kong: Longitudinal and synchronic characterisations of protest news between 1998 and
work page 1998
-
[3]
European Language Re- sources Association
InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2891– 2900, Marseille, France. European Language Re- sources Association. Alessio Miaschi, Sam Davidson, Dominique Brunato, Felice Dell’Orletta, Kenji Sagae, Claudia Helena Sanchez-Gutierrez, and Giulia Venturi. 2020. Track- ing the evolution of written language competence...
work page 2020
-
[4]
Ajay Patel, Colin Raffel, and Chris Callison-Burch
Automatic personality assessment through social media language.Journal of Personality and Social Psychology, 108(6):934–952. Ajay Patel, Colin Raffel, and Chris Callison-Burch
-
[5]
Llama 2: Open Foundation and Fine-Tuned Chat Models
DataDreamer: A tool for synthetic data gener- ation and reproducible LLM workflows. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 3781–3799, Bangkok, Thailand. Associ- ation for Computational Linguistics. Guillermo Ríos-Toledo, Juan P. F. Posadas-Durán, Grig- ori Sidorov, and N...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Remove multi-author bylines and organiza- tion names
-
[7]
Retain only authors with at least three consec- utive years of publication (2012–2022)
work page 2012
-
[8]
Randomly sample up to five articles per author per year
-
[9]
Re-scrape each URL via requests + BeautifulSoupto obtain full text
-
[10]
Exclude incomplete or very short articles (<70 words). The collected dataset includes 3,688 full-text arti- cles by 217 unique journalists, each with a continu- ous 5–11-year publication streak. After preprocess- ing (length filtering, boilerplate removal, metadata validation), 3,685 news articles remain. Among these, 117 journalists satisfy the longitudi...
work page 2023
-
[11]
Extraction:Extract keywords, summary, and word count from each human document
-
[12]
Template population:Insert extracted fields into configuration-specific templates
-
[13]
Generation:Query LLM API to produce syn- thetic text. Example sources.In the persona + examples configuration, style examples arenotsampled from the human corpus. Instead, a fixed set of manu- ally designed, genre-specific snippets is used (im- plemented in get_examples_by_genre). These generic examples are shared across all authors within a genre and con...
-
[14]
Generate— For each document in year t, produce LLM text using the base prompt prepended with cumulative history from years < t
-
[15]
Aggregate— After all texts for year t are generated, compress them into a single ≤80- word summary
-
[16]
Accumulate— Append the year- t summary to the history block for use in yeart+1. Steps 1–3 repeat for each subsequent year, building an incrementally growing history per author tra- jectory. This design mirrors a realistic scenario in which an LLM has access to a compressed record of its own prior outputs, testing whether such con- text enables more human-...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.