Towards a Linguistic Evaluation of Narratives: A Quantitative Stylistic Framework

Alessandro Maisto

arxiv: 2604.19261 · v1 · submitted 2026-04-21 · 💻 cs.CL

Towards a Linguistic Evaluation of Narratives: A Quantitative Stylistic Framework

Alessandro Maisto This is my paper

Pith reviewed 2026-05-10 02:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords narrative evaluationlinguistic featuresquantitative stylisticstext clusteringnarrative qualitystylistic analysissimilarity matrixcomputational linguistics

0 comments

The pith

A set of 33 linguistic features can automatically cluster professionally edited narratives apart from self-published ones and match human quality judgments better than standard metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a quantitative method for assessing narrative quality by extracting 33 linguistic features divided into lexical, syntactic, and semantic categories. It applies this to a corpus of 23 books and uses a similarity matrix to group the texts, achieving near-perfect separation between professionally edited works and self-published ones. The system is then tested on human-annotated data where it outperforms traditional story-level evaluation approaches. This matters for turning subjective judgments about writing quality into measurable, language-based signals that could scale to larger collections of text.

Core claim

The central claim is that quantitative linguistic features alone provide a reliable basis for evaluating narrative quality. On a specialized corpus of 23 books containing both canonical masterpieces and self-published works, a similarity matrix built from the 33 features clustered the narratives so that professionally edited texts were distinguished almost perfectly from self-published ones. When validated against a human-annotated dataset, the linguistic approach significantly outperformed traditional story-level evaluation metrics.

What carries the argument

The extraction of 33 quantitative linguistic features (lexical, syntactic, semantic) followed by construction of a similarity matrix to cluster and compare narratives.

If this is right

Narrative quality can be assessed automatically from language patterns without needing plot summaries or character analysis.
Professional editing produces detectable linguistic signatures that set texts apart from unedited self-published work.
The framework offers a quantitative alternative to existing story-level metrics for comparing narrative styles.
Linguistic features appear more predictive of human-perceived quality than conventional evaluation tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the method holds on larger and more varied collections, writers could use it as an automated style-checker during revision.
The observed separation may primarily reflect editing and polishing steps rather than raw creative quality, pointing to uses in publishing screening pipelines.
Adding genre-specific feature weights or testing on AI-generated text could extend the approach to new detection tasks.

Load-bearing premise

The 33 linguistic features capture the essential signals of narrative quality and the clear separation seen in this small set of 23 books will generalize to other texts despite differences in genre, length, or content.

What would settle it

Run the same 33-feature clustering on a fresh collection of 50 books balanced across genres and publication types; if the professional versus self-published groups no longer separate at high accuracy, the central claim fails.

Figures

Figures reproduced from arXiv: 2604.19261 by Alessandro Maisto.

read the original abstract

The evaluation of narrative quality remains a complex challenge, as it involves subjective factors such as plot, character development, and emotional impact. This work proposes a quantitative approach to narrative assessment by focusing on the linguistic dimension as a primary indicator of quality. The paper presents a methodology for the automatic evaluation of narrative based on the extraction of a comprehensive set of 33 quantitative linguistic features categorized into lexical, syntactic, and semantic groups. To test the model, an experiment was conducted on a specialized corpus of 23 books, including canonical masterpieces and self-published works. Through a similarity matrix, the system successfully clustered the narratives, distinguishing almost perfectly between professionally edited and self-published texts. Furthermore, the methodology was validated against a human-annotated dataset; it significantly outperforms traditional story-level evaluation metrics, demonstrating the effectiveness of quantitative linguistic features in assessing narrative quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper clusters 23 books near-perfectly with 33 linguistic features but the separation likely tracks length, genre, or content instead of narrative quality.

read the letter

The key takeaway is that this work clusters a small set of 23 books almost perfectly into professional and self-published using 33 linguistic features, and reports better results than traditional metrics on human judgments. But the setup leaves room for the clusters to be explained by differences in length, genre, or content rather than quality. What is new is the specific grouping of those features into lexical, syntactic, and semantic buckets and applying them to narrative assessment via similarity matrix. The paper does a decent job laying out a quantitative framework and running an experiment on real texts, which is more than just theory. The main issues are the tiny corpus size and lack of controls. Nothing in the description indicates they matched the books on length or genre, so features like type-token ratio or sentence complexity could dominate the similarity scores. The human-annotated validation is stated as outperforming others, but without details on the dataset size, how annotations were done, or statistical tests, it's difficult to assess. Feature definitions and the exact clustering method also need more transparency. This paper would interest people working on computational stylistics or digital humanities tools for text analysis. A reader could pick up the feature list and try it on their own data. It has enough of a concrete method and initial results to warrant peer review, though it will need substantial work on the experimental design to be convincing. I'd send it to referees.

Referee Report

4 major / 3 minor

Summary. The paper proposes a quantitative stylistic framework for narrative evaluation based on 33 linguistic features grouped into lexical, syntactic, and semantic categories. It reports that a similarity matrix computed over these features on a corpus of 23 books produces near-perfect clustering that separates professionally edited canonical works from self-published texts. The approach is further validated on a human-annotated dataset, where it is claimed to significantly outperform traditional story-level evaluation metrics.

Significance. If the central claims hold after addressing confounds, the work could contribute an objective, feature-driven method for assessing narrative quality with potential uses in literary studies, publishing workflows, and evaluation of generated text. The combination of automated clustering and human validation is a constructive direction, though the small corpus size and absence of controls substantially limit current generalizability and impact.

major comments (4)

[Section 4.1] Section 4.1 (Corpus): The 23-book corpus is not described as balanced or controlled for length, genre, topic, or publication era. Multiple features (type-token ratio, sentence length, vocabulary richness) are known to covary with these variables, so the observed near-perfect separation may be driven by confounds rather than narrative quality.
[Section 3] Section 3 (Feature Extraction): The 33 features are enumerated by category but lack explicit computational definitions, formulas, normalization steps, or preprocessing details (e.g., dialogue handling). This prevents replication and makes it impossible to evaluate whether the feature set was fixed a priori or tuned to the test corpus.
[Section 4.2] Section 4.2 (Similarity and Clustering): No specification is given for the similarity metric, any dimensionality reduction, the clustering algorithm, or quantitative cluster-quality metrics (silhouette score, adjusted Rand index). The claim of 'near-perfect' separation is unsupported by statistical significance tests or permutation baselines.
[Section 5] Section 5 (Human Validation): The human-annotated dataset is referenced without reporting its size, annotation guidelines, inter-annotator agreement, exact performance metrics, or the specific traditional baselines against which outperformance is asserted.

minor comments (3)

[Abstract] Abstract: The phrase 'almost perfectly' is imprecise; a numerical accuracy, confusion matrix, or reference to a specific figure/table should be supplied.
[Section 3] Notation: Feature names and similarity-matrix construction are introduced without consistent mathematical notation or pseudocode, hindering clarity.
[References] References: Standard stylometric and narrative-evaluation literature (e.g., works on type-token ratio, syntactic complexity measures) is under-cited.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will incorporate revisions to improve reproducibility, statistical rigor, and transparency.

read point-by-point responses

Referee: [Section 4.1] Section 4.1 (Corpus): The 23-book corpus is not described as balanced or controlled for length, genre, topic, or publication era. Multiple features (type-token ratio, sentence length, vocabulary richness) are known to covary with these variables, so the observed near-perfect separation may be driven by confounds rather than narrative quality.

Authors: We acknowledge that Section 4.1 does not explicitly report balancing or distributions for length, genre, topic, or era, which could introduce confounds. The corpus was assembled to contrast professional canonical works with self-published texts, but we will add a table in the revision detailing lengths, genres, publication eras, and topics for all 23 books, plus correlation analysis between these variables and the 33 features to quantify potential confounds. revision: partial
Referee: [Section 3] Section 3 (Feature Extraction): The 33 features are enumerated by category but lack explicit computational definitions, formulas, normalization steps, or preprocessing details (e.g., dialogue handling). This prevents replication and makes it impossible to evaluate whether the feature set was fixed a priori or tuned to the test corpus.

Authors: We agree that explicit definitions are required for reproducibility. The 33 features were selected a priori from linguistic literature and fixed before experiments. In the revision we will add formulas, normalization procedures, and preprocessing details, including dialogue handling (quoted speech is excluded from lexical and syntactic counts and processed separately for semantic features). revision: yes
Referee: [Section 4.2] Section 4.2 (Similarity and Clustering): No specification is given for the similarity metric, any dimensionality reduction, the clustering algorithm, or quantitative cluster-quality metrics (silhouette score, adjusted Rand index). The claim of 'near-perfect' separation is unsupported by statistical significance tests or permutation baselines.

Authors: We will revise Section 4.2 to specify cosine similarity on the normalized feature vectors, hierarchical clustering with Ward linkage, and no dimensionality reduction. We will report silhouette scores and include a permutation baseline (randomly shuffling labels 1000 times) to test statistical significance of the observed separation. revision: yes
Referee: [Section 5] Section 5 (Human Validation): The human-annotated dataset is referenced without reporting its size, annotation guidelines, inter-annotator agreement, exact performance metrics, or the specific traditional baselines against which outperformance is asserted.

Authors: We will expand Section 5 to report the dataset size, full annotation guidelines, inter-annotator agreement (Cohen's kappa), exact metrics (precision, recall, F1), and the specific traditional baselines used (Flesch-Kincaid readability, basic sentiment polarity, and plot coherence heuristics). revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical clustering and validation are independent of inputs

full rationale

The paper extracts a fixed set of 33 linguistic features (lexical, syntactic, semantic) from 23 books, computes a similarity matrix to produce clusters separating professionally edited from self-published texts, and validates the approach on a separate human-annotated dataset where it outperforms traditional metrics. No equations, self-definitional steps, or fitted parameters are described that reduce the claimed result to the input data by construction. The feature set is presented as comprehensive and pre-specified rather than derived from the clustering outcome; the human validation provides an external benchmark. No load-bearing self-citations or uniqueness theorems are invoked in the abstract or described methodology that would create a circular chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach relies on unspecified linguistic feature extraction and similarity computation whose details are absent.

pith-pipeline@v0.9.0 · 5430 in / 1276 out tokens · 56108 ms · 2026-05-10T02:06:05.254494+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

[1]

McCabe, C

A. McCabe, C. Peterson, What makes a good story, Journal of Psycholinguistic Research 13 (1984) 457–480

work page 1984
[2]

Dickman, The four elements of every successful story, REFLECTIONS-SOCIETY FOR ORGANI- ZATIONAL LEARINING 4 (2003) 51–58

R. Dickman, The four elements of every successful story, REFLECTIONS-SOCIETY FOR ORGANI- ZATIONAL LEARINING 4 (2003) 51–58

work page 2003
[3]

B.-C. Bae, S. Jang, Y. Kim, S. Park, A preliminary survey on story interestingness: Focusing on cognitive and emotional interest, in: International conference on interactive digital storytelling, Springer, 2021, pp. 447–453

work page 2021
[4]

Chhun, F

C. Chhun, F. M. Suchanek, C. Clavel, Do language models enjoy their own stories? Prompting large language models for automatic story evaluation, Transactions of the Association for Computational Linguistics 12 (2024) 1122–1142. doi:10.1162/tacl_a_00689

work page doi:10.1162/tacl_a_00689 2024
[5]

Chhun, P

C. Chhun, P. Colombo, F. M. Suchanek, C. Clavel, Of human criteria and automatic metrics: A benchmark of the evaluation of story generation, in: Proceedings of the 29th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 2022, pp. 5794–5836

work page 2022
[6]

J. Xu, X. Ren, Y. Zhang, Q. Zeng, X. Cai, X. Sun, A skeleton-based model for promoting coherence among sentences in narrative story generation, in: E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 201...

work page 2018
[7]

Jalalzai, P

H. Jalalzai, P. Colombo, C. Clavel, E. Gaussier, G. Varni, E. Vignon, A. Sabourin, Heavy-tailed representations, text polarity classification & data augmentation, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 4295–4307

work page 2020
[8]

Brahman, S

F. Brahman, S. Chaturvedi, Modeling protagonist emotions for emotion-aware storytelling, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 5277–5294

work page 2020
[9]

Rashkin, A

H. Rashkin, A. Celikyilmaz, Y. Choi, J. Gao, PlotMachines: Outline-conditioned generation with dynamic plot state tracking, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 4274–4295

work page 2020
[10]

Maisto, Collaborative storytelling and llm: A linguistic analysis of automatically-generated role-playing game sessions, arXiv preprint arXiv:2503.20623 (2025)

A. Maisto, Collaborative storytelling and llm: A linguistic analysis of automatically-generated role-playing game sessions, arXiv preprint arXiv:2503.20623 (2025)

work page arXiv 2025
[11]

Johansson, Open weight large language models as a design material in rpgs, 2025

L. Johansson, Open weight large language models as a design material in rpgs, 2025

work page 2025
[12]

Jhamtani, T

H. Jhamtani, T. Berg-Kirkpatrick, Narrative text generation with a latent discrete plan, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 3637–3650

work page 2020
[13]

Papineni, S

K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Association for Computational Linguistics, USA, 2002, p. 311–318

work page 2002
[14]

Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp

C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81

work page 2004
[15]

Grusky, M

M. Grusky, M. Naaman, Y. Artzi, Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies, in: M. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguis...

work page 2018
[16]

A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, D. Radev, Summeval: Re-evaluating summarization evaluation, Transactions of the Association for Computational Linguistics 9 (2021) 391–409

work page 2021
[17]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with bert, arXiv preprint arXiv:1904.09675 (2019)

work page internal anchor Pith review arXiv 1904
[18]

W. Yuan, G. Neubig, P. Liu, Bartscore: Evaluating generated text as text generation, Advances in neural information processing systems 34 (2021) 27263–27277

work page 2021
[19]

P. J. A. Colombo, C. Clavel, P. Piantanida, Infolm: A new metric to evaluate summarization & data2text generation, in: Proceedings of the AAAI conference on artificial intelligence, volume 36, 2022, pp. 10554–10562

work page 2022
[20]

Naismith, P

B. Naismith, P. Mulcaire, J. Burstein, Automated evaluation of written discourse coherence using gpt-4, in: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 2023, pp. 394–403

work page 2023
[21]

Z. Zhao, E. Wallace, S. Feng, D. Klein, S. Singh, Calibrate before use: Improving few-shot perfor- mance of language models, in: International conference on machine learning, PMLR, 2021, pp. 12697–12706

work page 2021
[22]

van Dalen-Oskam, The riddle of literary quality: a computational approach, Amsterdam Uni- versity Press, 2023

K. van Dalen-Oskam, The riddle of literary quality: a computational approach, Amsterdam Uni- versity Press, 2023

work page 2023
[23]

S. A. Crossley, K. Kyle, D. S. McNamara, The tool for the automatic analysis of text cohesion (taaco): Automatic assessment of local, global, and text cohesion, Behavior research methods 48 (2016) 1227–1237

work page 2016
[24]

S. A. Crossley, K. Kyle, M. Dascalu, The tool for the automatic analysis of cohesion 2.0: Integrating semantic similarity and text overlap, Behavior research methods 51 (2019) 14–27

work page 2019
[25]

F. Lima, A. Haendchen Filho, H. Prado, E. Ferneda, Automatic evaluation of textual cohesion in essays, in: 19th International Conference on Computational Linguistics and Intelligent Text Processing, 2018

work page 2018
[26]

Pellegrino, J

F. Pellegrino, J. Frey, L. Zanasi, Towards an automatic evaluation of (in) coherence in student essays, in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), 2024, pp. 757–765

work page 2024
[27]

C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, D. McClosky, The Stanford CoreNLP natural language processing toolkit, in: Association for Computational Linguistics (ACL) System Demonstrations, 2014, pp. 55–60

work page 2014
[28]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019
[29]

D. D. Malvern, B. J. Richards, A new measure of lexical diversity, British Studies in Applied Linguistics 12 (1997) 58–71

work page 1997
[30]

West, A general service list of english words, with semantic frequencies and a supplementary word-list for the writing of popular science and technology, (No Title) (1953)

M. West, A general service list of english words, with semantic frequencies and a supplementary word-list for the writing of popular science and technology, (No Title) (1953)

work page 1953
[31]

Coxhead, A new academic word list, TESOL quarterly 34 (2000) 213–238

A. Coxhead, A new academic word list, TESOL quarterly 34 (2000) 213–238

work page 2000
[32]

Coltheart, The mrc psycholinguistic database, The Quarterly Journal of Experimental Psychol- ogy Section A 33 (1981) 497–505

M. Coltheart, The mrc psycholinguistic database, The Quarterly Journal of Experimental Psychol- ogy Section A 33 (1981) 497–505

work page 1981
[33]

Melillo, A

L. Melillo, A. Maisto, et al., Valutazione automatica della leggibilità dei testi, in: Ieri E Oggi: La Terminologia E Le Sfide Delle Digital Humanities, EDUCatt, 2024, pp. 113–128

work page 2024
[34]

M. A. K. Halliday, R. Hasan, Cohesion in english, Routledge, 2014

work page 2014
[35]

C. A. Cameron, K. Lee, S. Webster, K. Munro, A. K. Hunt, M. J. Linton, Text cohesion in children’s narrative writing, Applied Psycholinguistics 16 (1995) 257–269

work page 1995
[36]

Rohde, L

D. Rohde, L. Gonnerman, D. Plaut, An improved method for deriving word meaning from lexical co-occurrence, Communication of the ACM 8 (2006)

work page 2006
[37]

Maisto, G

A. Maisto, G. Martorelli, A. Paone, S. Pelosi, Extracting video games rating labels from transcript files, Internet of Things 16 (2021) 100439

work page 2021
[38]

V. D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks, Journal of statistical mechanics: theory and experiment 2008 (2008) P10008

work page 2008
[39]

Bastian, S

M. Bastian, S. Heymann, M. Jacomy, Gephi: an open source software for exploring and manipulating networks., Icwsm 8 (2009) 361–362

work page 2009

[1] [1]

McCabe, C

A. McCabe, C. Peterson, What makes a good story, Journal of Psycholinguistic Research 13 (1984) 457–480

work page 1984

[2] [2]

Dickman, The four elements of every successful story, REFLECTIONS-SOCIETY FOR ORGANI- ZATIONAL LEARINING 4 (2003) 51–58

R. Dickman, The four elements of every successful story, REFLECTIONS-SOCIETY FOR ORGANI- ZATIONAL LEARINING 4 (2003) 51–58

work page 2003

[3] [3]

B.-C. Bae, S. Jang, Y. Kim, S. Park, A preliminary survey on story interestingness: Focusing on cognitive and emotional interest, in: International conference on interactive digital storytelling, Springer, 2021, pp. 447–453

work page 2021

[4] [4]

Chhun, F

C. Chhun, F. M. Suchanek, C. Clavel, Do language models enjoy their own stories? Prompting large language models for automatic story evaluation, Transactions of the Association for Computational Linguistics 12 (2024) 1122–1142. doi:10.1162/tacl_a_00689

work page doi:10.1162/tacl_a_00689 2024

[5] [5]

Chhun, P

C. Chhun, P. Colombo, F. M. Suchanek, C. Clavel, Of human criteria and automatic metrics: A benchmark of the evaluation of story generation, in: Proceedings of the 29th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 2022, pp. 5794–5836

work page 2022

[6] [6]

J. Xu, X. Ren, Y. Zhang, Q. Zeng, X. Cai, X. Sun, A skeleton-based model for promoting coherence among sentences in narrative story generation, in: E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 201...

work page 2018

[7] [7]

Jalalzai, P

H. Jalalzai, P. Colombo, C. Clavel, E. Gaussier, G. Varni, E. Vignon, A. Sabourin, Heavy-tailed representations, text polarity classification & data augmentation, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 4295–4307

work page 2020

[8] [8]

Brahman, S

F. Brahman, S. Chaturvedi, Modeling protagonist emotions for emotion-aware storytelling, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 5277–5294

work page 2020

[9] [9]

Rashkin, A

H. Rashkin, A. Celikyilmaz, Y. Choi, J. Gao, PlotMachines: Outline-conditioned generation with dynamic plot state tracking, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 4274–4295

work page 2020

[10] [10]

Maisto, Collaborative storytelling and llm: A linguistic analysis of automatically-generated role-playing game sessions, arXiv preprint arXiv:2503.20623 (2025)

A. Maisto, Collaborative storytelling and llm: A linguistic analysis of automatically-generated role-playing game sessions, arXiv preprint arXiv:2503.20623 (2025)

work page arXiv 2025

[11] [11]

Johansson, Open weight large language models as a design material in rpgs, 2025

L. Johansson, Open weight large language models as a design material in rpgs, 2025

work page 2025

[12] [12]

Jhamtani, T

H. Jhamtani, T. Berg-Kirkpatrick, Narrative text generation with a latent discrete plan, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 3637–3650

work page 2020

[13] [13]

Papineni, S

K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Association for Computational Linguistics, USA, 2002, p. 311–318

work page 2002

[14] [14]

Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp

C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81

work page 2004

[15] [15]

Grusky, M

M. Grusky, M. Naaman, Y. Artzi, Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies, in: M. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguis...

work page 2018

[16] [16]

A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, D. Radev, Summeval: Re-evaluating summarization evaluation, Transactions of the Association for Computational Linguistics 9 (2021) 391–409

work page 2021

[17] [17]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with bert, arXiv preprint arXiv:1904.09675 (2019)

work page internal anchor Pith review arXiv 1904

[18] [18]

W. Yuan, G. Neubig, P. Liu, Bartscore: Evaluating generated text as text generation, Advances in neural information processing systems 34 (2021) 27263–27277

work page 2021

[19] [19]

P. J. A. Colombo, C. Clavel, P. Piantanida, Infolm: A new metric to evaluate summarization & data2text generation, in: Proceedings of the AAAI conference on artificial intelligence, volume 36, 2022, pp. 10554–10562

work page 2022

[20] [20]

Naismith, P

B. Naismith, P. Mulcaire, J. Burstein, Automated evaluation of written discourse coherence using gpt-4, in: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 2023, pp. 394–403

work page 2023

[21] [21]

Z. Zhao, E. Wallace, S. Feng, D. Klein, S. Singh, Calibrate before use: Improving few-shot perfor- mance of language models, in: International conference on machine learning, PMLR, 2021, pp. 12697–12706

work page 2021

[22] [22]

van Dalen-Oskam, The riddle of literary quality: a computational approach, Amsterdam Uni- versity Press, 2023

K. van Dalen-Oskam, The riddle of literary quality: a computational approach, Amsterdam Uni- versity Press, 2023

work page 2023

[23] [23]

S. A. Crossley, K. Kyle, D. S. McNamara, The tool for the automatic analysis of text cohesion (taaco): Automatic assessment of local, global, and text cohesion, Behavior research methods 48 (2016) 1227–1237

work page 2016

[24] [24]

S. A. Crossley, K. Kyle, M. Dascalu, The tool for the automatic analysis of cohesion 2.0: Integrating semantic similarity and text overlap, Behavior research methods 51 (2019) 14–27

work page 2019

[25] [25]

F. Lima, A. Haendchen Filho, H. Prado, E. Ferneda, Automatic evaluation of textual cohesion in essays, in: 19th International Conference on Computational Linguistics and Intelligent Text Processing, 2018

work page 2018

[26] [26]

Pellegrino, J

F. Pellegrino, J. Frey, L. Zanasi, Towards an automatic evaluation of (in) coherence in student essays, in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), 2024, pp. 757–765

work page 2024

[27] [27]

C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, D. McClosky, The Stanford CoreNLP natural language processing toolkit, in: Association for Computational Linguistics (ACL) System Demonstrations, 2014, pp. 55–60

work page 2014

[28] [28]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019

[29] [29]

D. D. Malvern, B. J. Richards, A new measure of lexical diversity, British Studies in Applied Linguistics 12 (1997) 58–71

work page 1997

[30] [30]

West, A general service list of english words, with semantic frequencies and a supplementary word-list for the writing of popular science and technology, (No Title) (1953)

M. West, A general service list of english words, with semantic frequencies and a supplementary word-list for the writing of popular science and technology, (No Title) (1953)

work page 1953

[31] [31]

Coxhead, A new academic word list, TESOL quarterly 34 (2000) 213–238

A. Coxhead, A new academic word list, TESOL quarterly 34 (2000) 213–238

work page 2000

[32] [32]

Coltheart, The mrc psycholinguistic database, The Quarterly Journal of Experimental Psychol- ogy Section A 33 (1981) 497–505

M. Coltheart, The mrc psycholinguistic database, The Quarterly Journal of Experimental Psychol- ogy Section A 33 (1981) 497–505

work page 1981

[33] [33]

Melillo, A

L. Melillo, A. Maisto, et al., Valutazione automatica della leggibilità dei testi, in: Ieri E Oggi: La Terminologia E Le Sfide Delle Digital Humanities, EDUCatt, 2024, pp. 113–128

work page 2024

[34] [34]

M. A. K. Halliday, R. Hasan, Cohesion in english, Routledge, 2014

work page 2014

[35] [35]

C. A. Cameron, K. Lee, S. Webster, K. Munro, A. K. Hunt, M. J. Linton, Text cohesion in children’s narrative writing, Applied Psycholinguistics 16 (1995) 257–269

work page 1995

[36] [36]

Rohde, L

D. Rohde, L. Gonnerman, D. Plaut, An improved method for deriving word meaning from lexical co-occurrence, Communication of the ACM 8 (2006)

work page 2006

[37] [37]

Maisto, G

A. Maisto, G. Martorelli, A. Paone, S. Pelosi, Extracting video games rating labels from transcript files, Internet of Things 16 (2021) 100439

work page 2021

[38] [38]

V. D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks, Journal of statistical mechanics: theory and experiment 2008 (2008) P10008

work page 2008

[39] [39]

Bastian, S

M. Bastian, S. Heymann, M. Jacomy, Gephi: an open source software for exploring and manipulating networks., Icwsm 8 (2009) 361–362

work page 2009