Temporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories

Shanu Sushmita; YeoJin Go; Yifan Hu; Zhanwei Cao

arxiv: 2604.12097 · v1 · submitted 2026-04-13 · 💻 cs.CL

Temporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories

Zhanwei Cao , YeoJin Go , Yifan Hu , Shanu Sushmita This is my paper

Pith reviewed 2026-05-10 15:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords temporal flatteningLLM text generationhuman vs LLM trajectoriessemantic driftcognitive-emotional variabilitylongitudinal text analysisstyle evolutiontext classification

0 comments

The pith

LLMs exhibit reduced semantic and emotional drift over time compared to human writers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares how human writing changes across years with how LLMs generate text over simulated periods. Humans display ongoing shifts in meaning, word use, and emotional tone in their documents. LLMs instead show more consistent patterns in these areas, even when given prior text as context. This leads to temporal flattening that allows variability measures to classify sources with 94 percent accuracy. The result matters for any application that needs LLMs to produce realistic evolving content over extended time.

Core claim

Using a dataset of 412 human authors and 6,086 documents from 2012 to 2024 across academic abstracts, blogs, and news, the authors generate matching trajectories from three LLMs under independent and history-conditioned settings. Drift and variance metrics applied to semantic, lexical, and cognitive-emotional representations show that LLMs produce greater lexical diversity but substantially lower semantic and cognitive-emotional drift than humans. The pattern holds across generation modes, and temporal variability alone distinguishes human from LLM trajectories at 94 percent accuracy and 98 percent ROC-AUC.

What carries the argument

Temporal flattening, measured as reduced drift and variance across time in semantic, lexical, and cognitive-emotional representations of document sequences.

Load-bearing premise

The chosen drift and variance metrics over semantic, lexical, and cognitive-emotional representations measure genuine temporal structure rather than artifacts of the embeddings or domain selection.

What would settle it

Demonstrating human-comparable semantic and cognitive-emotional drift levels in trajectories from future LLMs that incorporate explicit long-term memory mechanisms would show the flattening is not inherent.

Figures

Figures reproduced from arXiv: 2604.12097 by Shanu Sushmita, YeoJin Go, Yifan Hu, Zhanwei Cao.

**Figure 2.** Figure 2: Human vs. LLM drift differences. Left: SBERT (semantic); right: TF–IDF (lexical). Positive values [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: GPT-4o-mini Cog-Emo CV differences for personality features (Big Five proxies), comparing instance [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Cog-Emo CV differences for sentiment features (polarity, subjectivity, VADER scores), comparing [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Cog-Emo CV differences for stylistic features, Group 1 (lexical diversity, readability, length, POS ratios), [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Cog-Emo CV differences for stylistic features, Group 2, comparing instance-wise and history-augmented [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used in daily applications, from content generation to code writing, where each interaction treats the model as stateless, generating responses independently without memory. Yet human writing is inherently longitudinal: authors' styles and cognitive states evolve across months and years. This raises a central question: can LLMs reproduce such temporal structure across extended time periods? We construct and publicly release a longitudinal dataset of 412 human authors and 6,086 documents spanning 2012--2024 across three domains (academic abstracts, blogs, news) and compare them to trajectories generated by three representative LLMs under standard and history-conditioned generation settings. Using drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations, we find temporal flattening in LLM-generated text. LLMs produce greater lexical diversity but exhibit substantially reduced semantic and cognitive-emotional drift relative to humans. These differences are highly predictive: temporal variability patterns alone achieve 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories. Our results demonstrate that temporal flattening persists regardless of whether LLMs generate independently or with access to incremental history, revealing a fundamental property of current deployment paradigms. This gap has direct implications for applications requiring authentic temporal structure, such as synthetic training data and longitudinal text modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds LLMs flatten semantic and cognitive drift over time versus humans, yielding 94% separation from variability patterns, but the metrics risk picking up embedding artifacts instead of real temporal structure.

read the letter

The core result is that LLMs generate text with less semantic and cognitive-emotional change across years than human authors do, even when fed prior history, and simple drift-plus-variance features from those trajectories separate the two at 94% accuracy and 98% ROC-AUC. They also release a 12-year, 412-author, three-domain corpus that looks like a practical resource for anyone tracking style evolution. That combination of new data and a clean stateless-versus-history-conditioned comparison is the useful part. The flattening holds across academic, blog, and news text, which gives the claim some breadth. The authors treat the gap as a property of current training and inference setups rather than a fixable prompt issue, and the numbers line up internally. What is actually new is the longitudinal scale plus the predictive use of the variability signatures themselves. Prior work on style consistency or repetition mostly looked at short windows or single domains; this one tracks real multi-year human trajectories and shows the LLM versions stay flatter on the same axes. The dataset release and the persistence under history conditioning are the parts that could be cited or reused. The soft spot is the metric layer. The separation depends on semantic, lexical, and cognitive-emotional representations whose drift definitions are not fully spelled out in the abstract, and the stress-test concern about embedding or domain-sampling artifacts is reasonable. If the chosen embeddings already cluster LLM outputs differently because of training-data distribution rather than because trajectories are flatter, the classifier could succeed without the claimed temporal mechanism being the driver. Domain balance across the 412 authors and any post-hoc threshold choices on the drift metrics would need explicit checks. The lexical-diversity increase alongside reduced semantic drift is interesting but could be an artifact of how the three representations were chosen. This is worth a serious referee for groups building synthetic longitudinal data or studying long-context modeling. The dataset alone makes it refereeable even if the metric validation needs tightening. I would bring it to a reading group to see the full methods and ablations, but I would not cite the accuracy number until the artifact checks are in place.

Referee Report

2 major / 2 minor

Summary. The manuscript constructs and releases a longitudinal dataset of 412 human authors and 6,086 documents (2012-2024) across academic abstracts, blogs, and news. It generates matching trajectories from three LLMs under independent and history-conditioned settings, then applies drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations. The central finding is temporal flattening in LLM output: greater lexical diversity but substantially reduced semantic and cognitive-emotional drift relative to humans. These patterns alone yield 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories, persisting across generation conditions.

Significance. If the separation is robust to representation choices and domain controls, the work supplies quantitative evidence that current stateless or short-context LLM deployment paradigms cannot reproduce the longitudinal evolution characteristic of human writing. This has direct consequences for synthetic training data, author-style modeling, and longitudinal NLP tasks. The public dataset release is a clear strength that supports reproducibility and follow-on work.

major comments (2)

[Methods] Methods section: The drift and variance metrics are applied to three representation families, yet no ablation is reported on alternative embeddings or on explicit domain-balance controls across the 412-author corpus. If the reduced semantic/cognitive drift is partly an artifact of how the chosen embeddings encode LLM outputs (rather than genuine flattening), the 94% classification accuracy would not establish the claimed temporal property. This is load-bearing for the headline result.
[Results] Results section (classification experiments): The abstract states that temporal variability patterns achieve 94% accuracy and 98% ROC-AUC, but it is unclear whether the drift/variance thresholds were pre-specified before seeing the data or tuned afterward. If the latter, the separation may be inflated; explicit reporting of pre-registration or nested cross-validation is required to confirm the metric is not post-hoc.

minor comments (2)

[Abstract] Abstract: The claim of persistence 'regardless of whether LLMs generate independently or with access to incremental history' should be accompanied by a brief quantitative statement of the accuracy drop (or lack thereof) under the history-conditioned condition.
[Dataset Construction] Dataset description: Provide a table or paragraph confirming the number of documents per domain and per author to allow readers to assess balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review, as well as for recognizing the potential significance of the temporal flattening findings and the value of the released dataset. We address each major comment below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses

Referee: [Methods] Methods section: The drift and variance metrics are applied to three representation families, yet no ablation is reported on alternative embeddings or on explicit domain-balance controls across the 412-author corpus. If the reduced semantic/cognitive drift is partly an artifact of how the chosen embeddings encode LLM outputs (rather than genuine flattening), the 94% classification accuracy would not establish the claimed temporal property. This is load-bearing for the headline result.

Authors: We agree that additional robustness checks are necessary to rule out embedding-specific artifacts. The three representation families (semantic sentence embeddings, lexical TF-IDF, and cognitive-emotional lexicons) were chosen to align with established longitudinal text analysis practices, and the flattening pattern holds across all three. Nevertheless, to directly address this concern we will add (i) ablations using alternative embeddings (e.g., different sentence-transformer variants and static word embeddings) and (ii) explicit domain-balanced subsampling experiments that enforce equal representation of academic, blog, and news documents per author. These results will be reported in a new subsection of the Methods and Results sections. revision: yes
Referee: [Results] Results section (classification experiments): The abstract states that temporal variability patterns achieve 94% accuracy and 98% ROC-AUC, but it is unclear whether the drift/variance thresholds were pre-specified before seeing the data or tuned afterward. If the latter, the separation may be inflated; explicit reporting of pre-registration or nested cross-validation is required to confirm the metric is not post-hoc.

Authors: The drift and variance thresholds were computed from the empirical distribution of human trajectories on a held-out training partition and then applied to a disjoint test partition; no hyperparameter search was performed on the test data. To eliminate any remaining ambiguity about post-hoc selection, we will replace the single-threshold classifier with a nested cross-validation procedure (outer loop for evaluation, inner loop for threshold selection) and report the resulting accuracy and ROC-AUC with confidence intervals. We will also add a statement clarifying the train/test split protocol in the revised Results section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metric application on independent trajectories

full rationale

The paper constructs an external longitudinal dataset of human authors and generates LLM trajectories under explicit settings, then applies standard drift and variance metrics over semantic/lexical/cognitive representations to measure differences. The reported 94% accuracy and 98% ROC-AUC are direct empirical outcomes of a downstream classifier on these computed features, with no equations or steps that reduce the distinction to a fitted parameter by construction, self-definition of the target quantity, or load-bearing self-citation. The derivation chain remains self-contained against the released data and generation protocols.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions about what constitutes temporal evolution in text and on the validity of off-the-shelf embedding models for semantic and emotional drift; no new entities are postulated and free parameters appear limited to metric implementation details not detailed in the abstract.

free parameters (1)

drift and variance thresholds
Parameters used to quantify drift and variance in the chosen representations; exact values or fitting procedures not specified in abstract.

axioms (1)

domain assumption Human writing trajectories exhibit measurable longitudinal evolution in semantic, lexical, and cognitive-emotional dimensions.
Foundational premise for constructing the comparison and interpreting reduced drift as flattening.

pith-pipeline@v0.9.0 · 5540 in / 1288 out tokens · 32749 ms · 2026-05-10T15:02:48.382859+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Zefang Liu and Yinzhu Quan

Diachronic word embeddings reveal statisti- cal laws of semantic change. InProceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1489–1501, Berlin, Germany. Association for Com- putational Linguistics. Baixiang Huang, Canyu Chen, and Kai Shu. 2024. Can large language models identify author...

work page arXiv 2024
[2]

Hong Kong: Longitudinal and synchronic characterisations of protest news between 1998 and

work page 1998
[3]

European Language Re- sources Association

InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2891– 2900, Marseille, France. European Language Re- sources Association. Alessio Miaschi, Sam Davidson, Dominique Brunato, Felice Dell’Orletta, Kenji Sagae, Claudia Helena Sanchez-Gutierrez, and Giulia Venturi. 2020. Track- ing the evolution of written language competence...

work page 2020
[4]

Ajay Patel, Colin Raffel, and Chris Callison-Burch

Automatic personality assessment through social media language.Journal of Personality and Social Psychology, 108(6):934–952. Ajay Patel, Colin Raffel, and Chris Callison-Burch

work page
[5]

Llama 2: Open Foundation and Fine-Tuned Chat Models

DataDreamer: A tool for synthetic data gener- ation and reproducible LLM workflows. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 3781–3799, Bangkok, Thailand. Associ- ation for Computational Linguistics. Guillermo Ríos-Toledo, Juan P. F. Posadas-Durán, Grig- ori Sidorov, and N...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Remove multi-author bylines and organiza- tion names

work page
[7]

Retain only authors with at least three consec- utive years of publication (2012–2022)

work page 2012
[8]

Randomly sample up to five articles per author per year

work page
[9]

Re-scrape each URL via requests + BeautifulSoupto obtain full text

work page
[10]

The collected dataset includes 3,688 full-text arti- cles by 217 unique journalists, each with a continu- ous 5–11-year publication streak

Exclude incomplete or very short articles (<70 words). The collected dataset includes 3,688 full-text arti- cles by 217 unique journalists, each with a continu- ous 5–11-year publication streak. After preprocess- ing (length filtering, boilerplate removal, metadata validation), 3,685 news articles remain. Among these, 117 journalists satisfy the longitudi...

work page 2023
[11]

Extraction:Extract keywords, summary, and word count from each human document

work page
[12]

Template population:Insert extracted fields into configuration-specific templates

work page
[13]

Example sources.In the persona + examples configuration, style examples arenotsampled from the human corpus

Generation:Query LLM API to produce syn- thetic text. Example sources.In the persona + examples configuration, style examples arenotsampled from the human corpus. Instead, a fixed set of manu- ally designed, genre-specific snippets is used (im- plemented in get_examples_by_genre). These generic examples are shared across all authors within a genre and con...

work page
[14]

Generate— For each document in year t, produce LLM text using the base prompt prepended with cumulative history from years < t

work page
[15]

Aggregate— After all texts for year t are generated, compress them into a single ≤80- word summary

work page
[16]

shadow authors

Accumulate— Append the year- t summary to the history block for use in yeart+1. Steps 1–3 repeat for each subsequent year, building an incrementally growing history per author tra- jectory. This design mirrors a realistic scenario in which an LLM has access to a compressed record of its own prior outputs, testing whether such con- text enables more human-...

work page

[1] [1]

Zefang Liu and Yinzhu Quan

Diachronic word embeddings reveal statisti- cal laws of semantic change. InProceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1489–1501, Berlin, Germany. Association for Com- putational Linguistics. Baixiang Huang, Canyu Chen, and Kai Shu. 2024. Can large language models identify author...

work page arXiv 2024

[2] [2]

Hong Kong: Longitudinal and synchronic characterisations of protest news between 1998 and

work page 1998

[3] [3]

European Language Re- sources Association

InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2891– 2900, Marseille, France. European Language Re- sources Association. Alessio Miaschi, Sam Davidson, Dominique Brunato, Felice Dell’Orletta, Kenji Sagae, Claudia Helena Sanchez-Gutierrez, and Giulia Venturi. 2020. Track- ing the evolution of written language competence...

work page 2020

[4] [4]

Ajay Patel, Colin Raffel, and Chris Callison-Burch

Automatic personality assessment through social media language.Journal of Personality and Social Psychology, 108(6):934–952. Ajay Patel, Colin Raffel, and Chris Callison-Burch

work page

[5] [5]

Llama 2: Open Foundation and Fine-Tuned Chat Models

DataDreamer: A tool for synthetic data gener- ation and reproducible LLM workflows. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 3781–3799, Bangkok, Thailand. Associ- ation for Computational Linguistics. Guillermo Ríos-Toledo, Juan P. F. Posadas-Durán, Grig- ori Sidorov, and N...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Remove multi-author bylines and organiza- tion names

work page

[7] [7]

Retain only authors with at least three consec- utive years of publication (2012–2022)

work page 2012

[8] [8]

Randomly sample up to five articles per author per year

work page

[9] [9]

Re-scrape each URL via requests + BeautifulSoupto obtain full text

work page

[10] [10]

The collected dataset includes 3,688 full-text arti- cles by 217 unique journalists, each with a continu- ous 5–11-year publication streak

Exclude incomplete or very short articles (<70 words). The collected dataset includes 3,688 full-text arti- cles by 217 unique journalists, each with a continu- ous 5–11-year publication streak. After preprocess- ing (length filtering, boilerplate removal, metadata validation), 3,685 news articles remain. Among these, 117 journalists satisfy the longitudi...

work page 2023

[11] [11]

Extraction:Extract keywords, summary, and word count from each human document

work page

[12] [12]

Template population:Insert extracted fields into configuration-specific templates

work page

[13] [13]

Example sources.In the persona + examples configuration, style examples arenotsampled from the human corpus

Generation:Query LLM API to produce syn- thetic text. Example sources.In the persona + examples configuration, style examples arenotsampled from the human corpus. Instead, a fixed set of manu- ally designed, genre-specific snippets is used (im- plemented in get_examples_by_genre). These generic examples are shared across all authors within a genre and con...

work page

[14] [14]

Generate— For each document in year t, produce LLM text using the base prompt prepended with cumulative history from years < t

work page

[15] [15]

Aggregate— After all texts for year t are generated, compress them into a single ≤80- word summary

work page

[16] [16]

shadow authors

Accumulate— Append the year- t summary to the history block for use in yeart+1. Steps 1–3 repeat for each subsequent year, building an incrementally growing history per author tra- jectory. This design mirrors a realistic scenario in which an LLM has access to a compressed record of its own prior outputs, testing whether such con- text enables more human-...

work page