pith. sign in

arxiv: 2604.12097 · v1 · submitted 2026-04-13 · 💻 cs.CL

Temporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories

Pith reviewed 2026-05-10 15:02 UTC · model grok-4.3

classification 💻 cs.CL
keywords temporal flatteningLLM text generationhuman vs LLM trajectoriessemantic driftcognitive-emotional variabilitylongitudinal text analysisstyle evolutiontext classification
0
0 comments X

The pith

LLMs exhibit reduced semantic and emotional drift over time compared to human writers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares how human writing changes across years with how LLMs generate text over simulated periods. Humans display ongoing shifts in meaning, word use, and emotional tone in their documents. LLMs instead show more consistent patterns in these areas, even when given prior text as context. This leads to temporal flattening that allows variability measures to classify sources with 94 percent accuracy. The result matters for any application that needs LLMs to produce realistic evolving content over extended time.

Core claim

Using a dataset of 412 human authors and 6,086 documents from 2012 to 2024 across academic abstracts, blogs, and news, the authors generate matching trajectories from three LLMs under independent and history-conditioned settings. Drift and variance metrics applied to semantic, lexical, and cognitive-emotional representations show that LLMs produce greater lexical diversity but substantially lower semantic and cognitive-emotional drift than humans. The pattern holds across generation modes, and temporal variability alone distinguishes human from LLM trajectories at 94 percent accuracy and 98 percent ROC-AUC.

What carries the argument

Temporal flattening, measured as reduced drift and variance across time in semantic, lexical, and cognitive-emotional representations of document sequences.

Load-bearing premise

The chosen drift and variance metrics over semantic, lexical, and cognitive-emotional representations measure genuine temporal structure rather than artifacts of the embeddings or domain selection.

What would settle it

Demonstrating human-comparable semantic and cognitive-emotional drift levels in trajectories from future LLMs that incorporate explicit long-term memory mechanisms would show the flattening is not inherent.

Figures

Figures reproduced from arXiv: 2604.12097 by Shanu Sushmita, YeoJin Go, Yifan Hu, Zhanwei Cao.

Figure 1
Figure 1. Figure 1: Human vs. LLM trajectory generation under [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Human vs. LLM drift differences. Left: SBERT (semantic); right: TF–IDF (lexical). Positive values [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: GPT-4o-mini Cog-Emo CV differences for personality features (Big Five proxies), comparing instance [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cog-Emo CV differences for sentiment features (polarity, subjectivity, VADER scores), comparing [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cog-Emo CV differences for stylistic features, Group 1 (lexical diversity, readability, length, POS ratios), [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cog-Emo CV differences for stylistic features, Group 2, comparing instance-wise and history-augmented [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used in daily applications, from content generation to code writing, where each interaction treats the model as stateless, generating responses independently without memory. Yet human writing is inherently longitudinal: authors' styles and cognitive states evolve across months and years. This raises a central question: can LLMs reproduce such temporal structure across extended time periods? We construct and publicly release a longitudinal dataset of 412 human authors and 6,086 documents spanning 2012--2024 across three domains (academic abstracts, blogs, news) and compare them to trajectories generated by three representative LLMs under standard and history-conditioned generation settings. Using drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations, we find temporal flattening in LLM-generated text. LLMs produce greater lexical diversity but exhibit substantially reduced semantic and cognitive-emotional drift relative to humans. These differences are highly predictive: temporal variability patterns alone achieve 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories. Our results demonstrate that temporal flattening persists regardless of whether LLMs generate independently or with access to incremental history, revealing a fundamental property of current deployment paradigms. This gap has direct implications for applications requiring authentic temporal structure, such as synthetic training data and longitudinal text modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript constructs and releases a longitudinal dataset of 412 human authors and 6,086 documents (2012-2024) across academic abstracts, blogs, and news. It generates matching trajectories from three LLMs under independent and history-conditioned settings, then applies drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations. The central finding is temporal flattening in LLM output: greater lexical diversity but substantially reduced semantic and cognitive-emotional drift relative to humans. These patterns alone yield 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories, persisting across generation conditions.

Significance. If the separation is robust to representation choices and domain controls, the work supplies quantitative evidence that current stateless or short-context LLM deployment paradigms cannot reproduce the longitudinal evolution characteristic of human writing. This has direct consequences for synthetic training data, author-style modeling, and longitudinal NLP tasks. The public dataset release is a clear strength that supports reproducibility and follow-on work.

major comments (2)
  1. [Methods] Methods section: The drift and variance metrics are applied to three representation families, yet no ablation is reported on alternative embeddings or on explicit domain-balance controls across the 412-author corpus. If the reduced semantic/cognitive drift is partly an artifact of how the chosen embeddings encode LLM outputs (rather than genuine flattening), the 94% classification accuracy would not establish the claimed temporal property. This is load-bearing for the headline result.
  2. [Results] Results section (classification experiments): The abstract states that temporal variability patterns achieve 94% accuracy and 98% ROC-AUC, but it is unclear whether the drift/variance thresholds were pre-specified before seeing the data or tuned afterward. If the latter, the separation may be inflated; explicit reporting of pre-registration or nested cross-validation is required to confirm the metric is not post-hoc.
minor comments (2)
  1. [Abstract] Abstract: The claim of persistence 'regardless of whether LLMs generate independently or with access to incremental history' should be accompanied by a brief quantitative statement of the accuracy drop (or lack thereof) under the history-conditioned condition.
  2. [Dataset Construction] Dataset description: Provide a table or paragraph confirming the number of documents per domain and per author to allow readers to assess balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review, as well as for recognizing the potential significance of the temporal flattening findings and the value of the released dataset. We address each major comment below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses
  1. Referee: [Methods] Methods section: The drift and variance metrics are applied to three representation families, yet no ablation is reported on alternative embeddings or on explicit domain-balance controls across the 412-author corpus. If the reduced semantic/cognitive drift is partly an artifact of how the chosen embeddings encode LLM outputs (rather than genuine flattening), the 94% classification accuracy would not establish the claimed temporal property. This is load-bearing for the headline result.

    Authors: We agree that additional robustness checks are necessary to rule out embedding-specific artifacts. The three representation families (semantic sentence embeddings, lexical TF-IDF, and cognitive-emotional lexicons) were chosen to align with established longitudinal text analysis practices, and the flattening pattern holds across all three. Nevertheless, to directly address this concern we will add (i) ablations using alternative embeddings (e.g., different sentence-transformer variants and static word embeddings) and (ii) explicit domain-balanced subsampling experiments that enforce equal representation of academic, blog, and news documents per author. These results will be reported in a new subsection of the Methods and Results sections. revision: yes

  2. Referee: [Results] Results section (classification experiments): The abstract states that temporal variability patterns achieve 94% accuracy and 98% ROC-AUC, but it is unclear whether the drift/variance thresholds were pre-specified before seeing the data or tuned afterward. If the latter, the separation may be inflated; explicit reporting of pre-registration or nested cross-validation is required to confirm the metric is not post-hoc.

    Authors: The drift and variance thresholds were computed from the empirical distribution of human trajectories on a held-out training partition and then applied to a disjoint test partition; no hyperparameter search was performed on the test data. To eliminate any remaining ambiguity about post-hoc selection, we will replace the single-threshold classifier with a nested cross-validation procedure (outer loop for evaluation, inner loop for threshold selection) and report the resulting accuracy and ROC-AUC with confidence intervals. We will also add a statement clarifying the train/test split protocol in the revised Results section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metric application on independent trajectories

full rationale

The paper constructs an external longitudinal dataset of human authors and generates LLM trajectories under explicit settings, then applies standard drift and variance metrics over semantic/lexical/cognitive representations to measure differences. The reported 94% accuracy and 98% ROC-AUC are direct empirical outcomes of a downstream classifier on these computed features, with no equations or steps that reduce the distinction to a fitted parameter by construction, self-definition of the target quantity, or load-bearing self-citation. The derivation chain remains self-contained against the released data and generation protocols.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions about what constitutes temporal evolution in text and on the validity of off-the-shelf embedding models for semantic and emotional drift; no new entities are postulated and free parameters appear limited to metric implementation details not detailed in the abstract.

free parameters (1)
  • drift and variance thresholds
    Parameters used to quantify drift and variance in the chosen representations; exact values or fitting procedures not specified in abstract.
axioms (1)
  • domain assumption Human writing trajectories exhibit measurable longitudinal evolution in semantic, lexical, and cognitive-emotional dimensions.
    Foundational premise for constructing the comparison and interpreting reduced drift as flattening.

pith-pipeline@v0.9.0 · 5540 in / 1288 out tokens · 32749 ms · 2026-05-10T15:02:48.382859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    Zefang Liu and Yinzhu Quan

    Diachronic word embeddings reveal statisti- cal laws of semantic change. InProceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1489–1501, Berlin, Germany. Association for Com- putational Linguistics. Baixiang Huang, Canyu Chen, and Kai Shu. 2024. Can large language models identify author...

  2. [2]

    Hong Kong: Longitudinal and synchronic characterisations of protest news between 1998 and

  3. [3]

    European Language Re- sources Association

    InProceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2891– 2900, Marseille, France. European Language Re- sources Association. Alessio Miaschi, Sam Davidson, Dominique Brunato, Felice Dell’Orletta, Kenji Sagae, Claudia Helena Sanchez-Gutierrez, and Giulia Venturi. 2020. Track- ing the evolution of written language competence...

  4. [4]

    Ajay Patel, Colin Raffel, and Chris Callison-Burch

    Automatic personality assessment through social media language.Journal of Personality and Social Psychology, 108(6):934–952. Ajay Patel, Colin Raffel, and Chris Callison-Burch

  5. [5]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    DataDreamer: A tool for synthetic data gener- ation and reproducible LLM workflows. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 3781–3799, Bangkok, Thailand. Associ- ation for Computational Linguistics. Guillermo Ríos-Toledo, Juan P. F. Posadas-Durán, Grig- ori Sidorov, and N...

  6. [6]

    Remove multi-author bylines and organiza- tion names

  7. [7]

    Retain only authors with at least three consec- utive years of publication (2012–2022)

  8. [8]

    Randomly sample up to five articles per author per year

  9. [9]

    Re-scrape each URL via requests + BeautifulSoupto obtain full text

  10. [10]

    The collected dataset includes 3,688 full-text arti- cles by 217 unique journalists, each with a continu- ous 5–11-year publication streak

    Exclude incomplete or very short articles (<70 words). The collected dataset includes 3,688 full-text arti- cles by 217 unique journalists, each with a continu- ous 5–11-year publication streak. After preprocess- ing (length filtering, boilerplate removal, metadata validation), 3,685 news articles remain. Among these, 117 journalists satisfy the longitudi...

  11. [11]

    Extraction:Extract keywords, summary, and word count from each human document

  12. [12]

    Template population:Insert extracted fields into configuration-specific templates

  13. [13]

    Example sources.In the persona + examples configuration, style examples arenotsampled from the human corpus

    Generation:Query LLM API to produce syn- thetic text. Example sources.In the persona + examples configuration, style examples arenotsampled from the human corpus. Instead, a fixed set of manu- ally designed, genre-specific snippets is used (im- plemented in get_examples_by_genre). These generic examples are shared across all authors within a genre and con...

  14. [14]

    Generate— For each document in year t, produce LLM text using the base prompt prepended with cumulative history from years < t

  15. [15]

    Aggregate— After all texts for year t are generated, compress them into a single ≤80- word summary

  16. [16]

    shadow authors

    Accumulate— Append the year- t summary to the history block for use in yeart+1. Steps 1–3 repeat for each subsequent year, building an incrementally growing history per author tra- jectory. This design mirrors a realistic scenario in which an LLM has access to a compressed record of its own prior outputs, testing whether such con- text enables more human-...