pith. sign in

arxiv: 2604.19261 · v1 · submitted 2026-04-21 · 💻 cs.CL

Towards a Linguistic Evaluation of Narratives: A Quantitative Stylistic Framework

Pith reviewed 2026-05-10 02:06 UTC · model grok-4.3

classification 💻 cs.CL
keywords narrative evaluationlinguistic featuresquantitative stylisticstext clusteringnarrative qualitystylistic analysissimilarity matrixcomputational linguistics
0
0 comments X

The pith

A set of 33 linguistic features can automatically cluster professionally edited narratives apart from self-published ones and match human quality judgments better than standard metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a quantitative method for assessing narrative quality by extracting 33 linguistic features divided into lexical, syntactic, and semantic categories. It applies this to a corpus of 23 books and uses a similarity matrix to group the texts, achieving near-perfect separation between professionally edited works and self-published ones. The system is then tested on human-annotated data where it outperforms traditional story-level evaluation approaches. This matters for turning subjective judgments about writing quality into measurable, language-based signals that could scale to larger collections of text.

Core claim

The central claim is that quantitative linguistic features alone provide a reliable basis for evaluating narrative quality. On a specialized corpus of 23 books containing both canonical masterpieces and self-published works, a similarity matrix built from the 33 features clustered the narratives so that professionally edited texts were distinguished almost perfectly from self-published ones. When validated against a human-annotated dataset, the linguistic approach significantly outperformed traditional story-level evaluation metrics.

What carries the argument

The extraction of 33 quantitative linguistic features (lexical, syntactic, semantic) followed by construction of a similarity matrix to cluster and compare narratives.

If this is right

  • Narrative quality can be assessed automatically from language patterns without needing plot summaries or character analysis.
  • Professional editing produces detectable linguistic signatures that set texts apart from unedited self-published work.
  • The framework offers a quantitative alternative to existing story-level metrics for comparing narrative styles.
  • Linguistic features appear more predictive of human-perceived quality than conventional evaluation tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the method holds on larger and more varied collections, writers could use it as an automated style-checker during revision.
  • The observed separation may primarily reflect editing and polishing steps rather than raw creative quality, pointing to uses in publishing screening pipelines.
  • Adding genre-specific feature weights or testing on AI-generated text could extend the approach to new detection tasks.

Load-bearing premise

The 33 linguistic features capture the essential signals of narrative quality and the clear separation seen in this small set of 23 books will generalize to other texts despite differences in genre, length, or content.

What would settle it

Run the same 33-feature clustering on a fresh collection of 50 books balanced across genres and publication types; if the professional versus self-published groups no longer separate at high accuracy, the central claim fails.

Figures

Figures reproduced from arXiv: 2604.19261 by Alessandro Maisto.

Figure 1
Figure 1. Figure 1: Generated Network of selected books [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

The evaluation of narrative quality remains a complex challenge, as it involves subjective factors such as plot, character development, and emotional impact. This work proposes a quantitative approach to narrative assessment by focusing on the linguistic dimension as a primary indicator of quality. The paper presents a methodology for the automatic evaluation of narrative based on the extraction of a comprehensive set of 33 quantitative linguistic features categorized into lexical, syntactic, and semantic groups. To test the model, an experiment was conducted on a specialized corpus of 23 books, including canonical masterpieces and self-published works. Through a similarity matrix, the system successfully clustered the narratives, distinguishing almost perfectly between professionally edited and self-published texts. Furthermore, the methodology was validated against a human-annotated dataset; it significantly outperforms traditional story-level evaluation metrics, demonstrating the effectiveness of quantitative linguistic features in assessing narrative quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 3 minor

Summary. The paper proposes a quantitative stylistic framework for narrative evaluation based on 33 linguistic features grouped into lexical, syntactic, and semantic categories. It reports that a similarity matrix computed over these features on a corpus of 23 books produces near-perfect clustering that separates professionally edited canonical works from self-published texts. The approach is further validated on a human-annotated dataset, where it is claimed to significantly outperform traditional story-level evaluation metrics.

Significance. If the central claims hold after addressing confounds, the work could contribute an objective, feature-driven method for assessing narrative quality with potential uses in literary studies, publishing workflows, and evaluation of generated text. The combination of automated clustering and human validation is a constructive direction, though the small corpus size and absence of controls substantially limit current generalizability and impact.

major comments (4)
  1. [Section 4.1] Section 4.1 (Corpus): The 23-book corpus is not described as balanced or controlled for length, genre, topic, or publication era. Multiple features (type-token ratio, sentence length, vocabulary richness) are known to covary with these variables, so the observed near-perfect separation may be driven by confounds rather than narrative quality.
  2. [Section 3] Section 3 (Feature Extraction): The 33 features are enumerated by category but lack explicit computational definitions, formulas, normalization steps, or preprocessing details (e.g., dialogue handling). This prevents replication and makes it impossible to evaluate whether the feature set was fixed a priori or tuned to the test corpus.
  3. [Section 4.2] Section 4.2 (Similarity and Clustering): No specification is given for the similarity metric, any dimensionality reduction, the clustering algorithm, or quantitative cluster-quality metrics (silhouette score, adjusted Rand index). The claim of 'near-perfect' separation is unsupported by statistical significance tests or permutation baselines.
  4. [Section 5] Section 5 (Human Validation): The human-annotated dataset is referenced without reporting its size, annotation guidelines, inter-annotator agreement, exact performance metrics, or the specific traditional baselines against which outperformance is asserted.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'almost perfectly' is imprecise; a numerical accuracy, confusion matrix, or reference to a specific figure/table should be supplied.
  2. [Section 3] Notation: Feature names and similarity-matrix construction are introduced without consistent mathematical notation or pseudocode, hindering clarity.
  3. [References] References: Standard stylometric and narrative-evaluation literature (e.g., works on type-token ratio, syntactic complexity measures) is under-cited.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will incorporate revisions to improve reproducibility, statistical rigor, and transparency.

read point-by-point responses
  1. Referee: [Section 4.1] Section 4.1 (Corpus): The 23-book corpus is not described as balanced or controlled for length, genre, topic, or publication era. Multiple features (type-token ratio, sentence length, vocabulary richness) are known to covary with these variables, so the observed near-perfect separation may be driven by confounds rather than narrative quality.

    Authors: We acknowledge that Section 4.1 does not explicitly report balancing or distributions for length, genre, topic, or era, which could introduce confounds. The corpus was assembled to contrast professional canonical works with self-published texts, but we will add a table in the revision detailing lengths, genres, publication eras, and topics for all 23 books, plus correlation analysis between these variables and the 33 features to quantify potential confounds. revision: partial

  2. Referee: [Section 3] Section 3 (Feature Extraction): The 33 features are enumerated by category but lack explicit computational definitions, formulas, normalization steps, or preprocessing details (e.g., dialogue handling). This prevents replication and makes it impossible to evaluate whether the feature set was fixed a priori or tuned to the test corpus.

    Authors: We agree that explicit definitions are required for reproducibility. The 33 features were selected a priori from linguistic literature and fixed before experiments. In the revision we will add formulas, normalization procedures, and preprocessing details, including dialogue handling (quoted speech is excluded from lexical and syntactic counts and processed separately for semantic features). revision: yes

  3. Referee: [Section 4.2] Section 4.2 (Similarity and Clustering): No specification is given for the similarity metric, any dimensionality reduction, the clustering algorithm, or quantitative cluster-quality metrics (silhouette score, adjusted Rand index). The claim of 'near-perfect' separation is unsupported by statistical significance tests or permutation baselines.

    Authors: We will revise Section 4.2 to specify cosine similarity on the normalized feature vectors, hierarchical clustering with Ward linkage, and no dimensionality reduction. We will report silhouette scores and include a permutation baseline (randomly shuffling labels 1000 times) to test statistical significance of the observed separation. revision: yes

  4. Referee: [Section 5] Section 5 (Human Validation): The human-annotated dataset is referenced without reporting its size, annotation guidelines, inter-annotator agreement, exact performance metrics, or the specific traditional baselines against which outperformance is asserted.

    Authors: We will expand Section 5 to report the dataset size, full annotation guidelines, inter-annotator agreement (Cohen's kappa), exact metrics (precision, recall, F1), and the specific traditional baselines used (Flesch-Kincaid readability, basic sentiment polarity, and plot coherence heuristics). revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical clustering and validation are independent of inputs

full rationale

The paper extracts a fixed set of 33 linguistic features (lexical, syntactic, semantic) from 23 books, computes a similarity matrix to produce clusters separating professionally edited from self-published texts, and validates the approach on a separate human-annotated dataset where it outperforms traditional metrics. No equations, self-definitional steps, or fitted parameters are described that reduce the claimed result to the input data by construction. The feature set is presented as comprehensive and pre-specified rather than derived from the clustering outcome; the human validation provides an external benchmark. No load-bearing self-citations or uniqueness theorems are invoked in the abstract or described methodology that would create a circular chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach relies on unspecified linguistic feature extraction and similarity computation whose details are absent.

pith-pipeline@v0.9.0 · 5430 in / 1276 out tokens · 56108 ms · 2026-05-10T02:06:05.254494+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    McCabe, C

    A. McCabe, C. Peterson, What makes a good story, Journal of Psycholinguistic Research 13 (1984) 457–480

  2. [2]

    Dickman, The four elements of every successful story, REFLECTIONS-SOCIETY FOR ORGANI- ZATIONAL LEARINING 4 (2003) 51–58

    R. Dickman, The four elements of every successful story, REFLECTIONS-SOCIETY FOR ORGANI- ZATIONAL LEARINING 4 (2003) 51–58

  3. [3]

    B.-C. Bae, S. Jang, Y. Kim, S. Park, A preliminary survey on story interestingness: Focusing on cognitive and emotional interest, in: International conference on interactive digital storytelling, Springer, 2021, pp. 447–453

  4. [4]

    Chhun, F

    C. Chhun, F. M. Suchanek, C. Clavel, Do language models enjoy their own stories? Prompting large language models for automatic story evaluation, Transactions of the Association for Computational Linguistics 12 (2024) 1122–1142. doi:10.1162/tacl_a_00689

  5. [5]

    Chhun, P

    C. Chhun, P. Colombo, F. M. Suchanek, C. Clavel, Of human criteria and automatic metrics: A benchmark of the evaluation of story generation, in: Proceedings of the 29th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 2022, pp. 5794–5836

  6. [6]

    J. Xu, X. Ren, Y. Zhang, Q. Zeng, X. Cai, X. Sun, A skeleton-based model for promoting coherence among sentences in narrative story generation, in: E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 201...

  7. [7]

    Jalalzai, P

    H. Jalalzai, P. Colombo, C. Clavel, E. Gaussier, G. Varni, E. Vignon, A. Sabourin, Heavy-tailed representations, text polarity classification & data augmentation, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 4295–4307

  8. [8]

    Brahman, S

    F. Brahman, S. Chaturvedi, Modeling protagonist emotions for emotion-aware storytelling, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 5277–5294

  9. [9]

    Rashkin, A

    H. Rashkin, A. Celikyilmaz, Y. Choi, J. Gao, PlotMachines: Outline-conditioned generation with dynamic plot state tracking, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 4274–4295

  10. [10]

    Maisto, Collaborative storytelling and llm: A linguistic analysis of automatically-generated role-playing game sessions, arXiv preprint arXiv:2503.20623 (2025)

    A. Maisto, Collaborative storytelling and llm: A linguistic analysis of automatically-generated role-playing game sessions, arXiv preprint arXiv:2503.20623 (2025)

  11. [11]

    Johansson, Open weight large language models as a design material in rpgs, 2025

    L. Johansson, Open weight large language models as a design material in rpgs, 2025

  12. [12]

    Jhamtani, T

    H. Jhamtani, T. Berg-Kirkpatrick, Narrative text generation with a latent discrete plan, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 3637–3650

  13. [13]

    Papineni, S

    K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Association for Computational Linguistics, USA, 2002, p. 311–318

  14. [14]

    Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp

    C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81

  15. [15]

    Grusky, M

    M. Grusky, M. Naaman, Y. Artzi, Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies, in: M. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguis...

  16. [16]

    A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, D. Radev, Summeval: Re-evaluating summarization evaluation, Transactions of the Association for Computational Linguistics 9 (2021) 391–409

  17. [17]

    BERTScore: Evaluating Text Generation with BERT

    T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with bert, arXiv preprint arXiv:1904.09675 (2019)

  18. [18]

    W. Yuan, G. Neubig, P. Liu, Bartscore: Evaluating generated text as text generation, Advances in neural information processing systems 34 (2021) 27263–27277

  19. [19]

    P. J. A. Colombo, C. Clavel, P. Piantanida, Infolm: A new metric to evaluate summarization & data2text generation, in: Proceedings of the AAAI conference on artificial intelligence, volume 36, 2022, pp. 10554–10562

  20. [20]

    Naismith, P

    B. Naismith, P. Mulcaire, J. Burstein, Automated evaluation of written discourse coherence using gpt-4, in: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 2023, pp. 394–403

  21. [21]

    Z. Zhao, E. Wallace, S. Feng, D. Klein, S. Singh, Calibrate before use: Improving few-shot perfor- mance of language models, in: International conference on machine learning, PMLR, 2021, pp. 12697–12706

  22. [22]

    van Dalen-Oskam, The riddle of literary quality: a computational approach, Amsterdam Uni- versity Press, 2023

    K. van Dalen-Oskam, The riddle of literary quality: a computational approach, Amsterdam Uni- versity Press, 2023

  23. [23]

    S. A. Crossley, K. Kyle, D. S. McNamara, The tool for the automatic analysis of text cohesion (taaco): Automatic assessment of local, global, and text cohesion, Behavior research methods 48 (2016) 1227–1237

  24. [24]

    S. A. Crossley, K. Kyle, M. Dascalu, The tool for the automatic analysis of cohesion 2.0: Integrating semantic similarity and text overlap, Behavior research methods 51 (2019) 14–27

  25. [25]

    F. Lima, A. Haendchen Filho, H. Prado, E. Ferneda, Automatic evaluation of textual cohesion in essays, in: 19th International Conference on Computational Linguistics and Intelligent Text Processing, 2018

  26. [26]

    Pellegrino, J

    F. Pellegrino, J. Frey, L. Zanasi, Towards an automatic evaluation of (in) coherence in student essays, in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), 2024, pp. 757–765

  27. [27]

    C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, D. McClosky, The Stanford CoreNLP natural language processing toolkit, in: Association for Computational Linguistics (ACL) System Demonstrations, 2014, pp. 55–60

  28. [28]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

  29. [29]

    D. D. Malvern, B. J. Richards, A new measure of lexical diversity, British Studies in Applied Linguistics 12 (1997) 58–71

  30. [30]

    West, A general service list of english words, with semantic frequencies and a supplementary word-list for the writing of popular science and technology, (No Title) (1953)

    M. West, A general service list of english words, with semantic frequencies and a supplementary word-list for the writing of popular science and technology, (No Title) (1953)

  31. [31]

    Coxhead, A new academic word list, TESOL quarterly 34 (2000) 213–238

    A. Coxhead, A new academic word list, TESOL quarterly 34 (2000) 213–238

  32. [32]

    Coltheart, The mrc psycholinguistic database, The Quarterly Journal of Experimental Psychol- ogy Section A 33 (1981) 497–505

    M. Coltheart, The mrc psycholinguistic database, The Quarterly Journal of Experimental Psychol- ogy Section A 33 (1981) 497–505

  33. [33]

    Melillo, A

    L. Melillo, A. Maisto, et al., Valutazione automatica della leggibilità dei testi, in: Ieri E Oggi: La Terminologia E Le Sfide Delle Digital Humanities, EDUCatt, 2024, pp. 113–128

  34. [34]

    M. A. K. Halliday, R. Hasan, Cohesion in english, Routledge, 2014

  35. [35]

    C. A. Cameron, K. Lee, S. Webster, K. Munro, A. K. Hunt, M. J. Linton, Text cohesion in children’s narrative writing, Applied Psycholinguistics 16 (1995) 257–269

  36. [36]

    Rohde, L

    D. Rohde, L. Gonnerman, D. Plaut, An improved method for deriving word meaning from lexical co-occurrence, Communication of the ACM 8 (2006)

  37. [37]

    Maisto, G

    A. Maisto, G. Martorelli, A. Paone, S. Pelosi, Extracting video games rating labels from transcript files, Internet of Things 16 (2021) 100439

  38. [38]

    V. D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks, Journal of statistical mechanics: theory and experiment 2008 (2008) P10008

  39. [39]

    Bastian, S

    M. Bastian, S. Heymann, M. Jacomy, Gephi: an open source software for exploring and manipulating networks., Icwsm 8 (2009) 361–362