pith. sign in

arxiv: 2604.19584 · v1 · submitted 2026-04-21 · 💻 cs.CL

A Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational Poetry

Pith reviewed 2026-05-10 01:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords Sardinian poetryimprovisational poetryformulaicityoral traditioncorpus linguisticsminority languagescomputational analysisextemporaneous poetry
0
0 comments X

The pith

Sardinian extemporaneous poetry shows recurring formulaic patterns that support Parry and Lord's theory of oral composition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates A Bolu, the first structured corpus of cantada logudorese, consisting of 2,835 stanzas and 141,321 tokens of Sardinian improvised poetry. It combines descriptive statistical indices with computational linguistics techniques to examine the texts. The analysis identifies recurring patterns in the poets' real-time compositions. These patterns align with the formulaic nature of oral poetry described by Parry and Lord. The work also aims to improve NLP resources for under-resourced languages by documenting such traditions.

Core claim

The production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord's theory of formulaicity. This is shown through the multidimensional analysis of the A Bolu corpus, which maps poetic text characteristics and provides evidence for formulaic structures in live improvisation.

What carries the argument

The A Bolu corpus of Sardinian cantada logudorese poetry, analyzed via a multidimensional combination of descriptive statistical indices and computational linguistics techniques to detect formulaic recurring patterns.

If this is right

  • The corpus supplies concrete data for testing formulaic composition in living oral traditions beyond ancient examples.
  • Similar structured datasets could be built for other improvised poetic forms to compare formulaicity across languages.
  • The statistical and computational methods offer a template for quantifying oral creativity in minority language contexts.
  • NLP tools for Sardinian and related languages can incorporate the identified patterns to handle extemporaneous text better.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The patterns might help train generative models for poetry in low-resource languages by providing examples of formulaic improvisation.
  • Cross-checking the corpus against live recordings of performances could test whether the detected patterns hold in real-time delivery.
  • The approach could extend to other performative genres like rap battles or storytelling to see if formulaicity appears universally.

Load-bearing premise

The chosen multidimensional analysis reliably detects formulaicity without being influenced by annotation choices or data selection in the corpus construction.

What would settle it

A re-run of the analysis on an independently collected and differently annotated set of Sardinian extemporaneous poetry that fails to show the same recurring patterns would undermine the claim of formulaicity.

read the original abstract

The growing interest of Natural Language Processing (NLP) in minority languages has not yet bridged the gap in the preservation of oral linguistic heritage. In particular, extemporaneous poetry - a performative genre based on real-time improvisation, metrical-rhetorical competence - remains a largely unexplored area of computational linguistics. This methodological gap necessitates the creation of specific resources to document and analyse the structures of improvised poetry. This is the context in which A Bolu was created, the first structured corpus of extemporaneous poetry dedicated to cantada logudorese, a variant of the Sardinian language. The dataset comprises 2,835 stanzas for a total of 141,321 tokens. The study presents the architecture of the corpus and applies a multidimensional analysis combining descriptive statistical indices and computational linguistics techniques to map the characteristics of the poetic text. The results indicate that the production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord's theory of formulaicity. This evidence not only provides a new key to understanding oral creativity, but also offers a significant contribution to the development of NLP tools that are more inclusive and sensitive to the specificities of less widely spoken languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces A Bolu, the first structured corpus of Sardinian extemporaneous poetry (cantada logudorese) with 2,835 stanzas and 141,321 tokens. It details the corpus architecture and applies a multidimensional analysis using descriptive statistical indices together with computational linguistics techniques, reporting recurring patterns that the authors interpret as supporting Parry and Lord's theory of formulaicity in oral poetry.

Significance. The creation of this dedicated dataset for a minority-language oral tradition is a clear strength and provides a new resource for computational philology and NLP in low-resource settings. If the analysis can be strengthened, the work could advance understanding of improvisational poetic structures and support more inclusive language technologies.

major comments (1)
  1. [Results] Results section: The central claim that recurring patterns support Parry-Lord formulaicity is load-bearing for the paper's interpretive contribution, yet the analysis lacks a control corpus of non-improvisational Sardinian poetry (written or composed) subjected to the identical set of descriptive indices and computational metrics. Without this comparison, observed n-gram repetitions or structural regularities cannot be distinguished from generic effects of Sardinian meter, rhyme, or poetic conventions.
minor comments (2)
  1. [Abstract] Abstract: The description of the 'multidimensional analysis' would benefit from naming the specific statistical indices and computational techniques applied, as this directly affects assessment of the reported patterns.
  2. [Corpus Description] Corpus construction: Additional details on tokenization, stanza segmentation criteria, and any validation of the 2,835-stanza collection would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the single major comment below and have revised our interpretation accordingly.

read point-by-point responses
  1. Referee: Results section: The central claim that recurring patterns support Parry-Lord formulaicity is load-bearing for the paper's interpretive contribution, yet the analysis lacks a control corpus of non-improvisational Sardinian poetry (written or composed) subjected to the identical set of descriptive indices and computational metrics. Without this comparison, observed n-gram repetitions or structural regularities cannot be distinguished from generic effects of Sardinian meter, rhyme, or poetic conventions.

    Authors: We agree that this is a substantive limitation. Our current analysis identifies recurring n-gram and structural patterns in the improvisational corpus and interprets them as consistent with Parry and Lord's oral-formulaic theory, but without a matched control corpus of non-improvised Sardinian poetry we cannot rule out that some regularities stem from the language's metrical and rhyming conventions more broadly. In the revised manuscript we will (i) rephrase the central claim to state that the patterns are consistent with rather than direct support for the theory, (ii) add an explicit discussion of this limitation in the Results and Discussion sections, and (iii) outline the construction of a future control corpus as necessary follow-up work. These changes will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

Corpus creation and descriptive analysis without circular derivations

full rationale

The paper's derivation chain consists of corpus construction followed by application of descriptive statistical indices and computational linguistics techniques to identify recurring patterns in the extemporaneous poetry. These patterns are presented as evidence supporting Parry and Lord's theory of formulaicity. Since no mathematical models, parameter fittings, or predictions are involved that could be equivalent to the inputs by construction, and no self-citations are used to justify core premises, there is no circularity. The results are direct observations from the 2,835-stanza dataset, making the study self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work is presented as empirical corpus construction and analysis.

pith-pipeline@v0.9.0 · 5506 in / 1086 out tokens · 41873 ms · 2026-05-10T01:45:43.947207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Introduction and Background The growing interest of the Natural Language Processing (NLP) community in low-resource lan- guages, minority varieties and dialects reflects a broader shift toward linguistic inclusivity. For decades, computational tools and annotated re- sources were concentrated on a small set of high- resourcelanguages,leavingthevastmajorit...

  2. [2]

    Digital Preservation and Resource Cre- ation:We provide a high-fidelity digital repos- itory for a vulnerable minority language tradi- tion, preventing the loss of undocumented or fragmented transcriptions and establishing a foundation for future NLP tasks in Sardinian

  3. [3]

    MultidimensionalDataModeling:Unlikeflat- text corpora,A Boluis structured to include richmetadata—suchasthematicassignments, performer identifiers and precise execution timestamps per stanza—modeled in a hierar- arXiv:2604.19584v1 [cs.CL] 21 Apr 2026 chical format to facilitate complex relational queries

  4. [4]

    stylistic signatures

    Computational Stylistics Analysis:We demonstratetheutilityofthedatasetbypropos- ing it as a benchmark for investigating "stylistic signatures" and lexical complexity, enabling quantitative research into how real-time impro- visational pressures affect the linguistic and metrical choices of thecantadores. The aim of this resource is to integrate Sardinian ...

  5. [5]

    Methodology This section describes the methodological ap- proach adopted for the construction of the corpus, divided into the phases of data acquisition 2.1, archive structuring 2.2 and curation of the raw ma- terial 2.3. The central objective was to transform a heritage of oral tradition—fragmentary and dis- continuous by nature—into a structured digital...

  6. [6]

    This restriction was implemented to ensure consistency, re- liability and philological accuracy across the dataset

    Transcription Quality and Metadata Rich- ness:produced by the official editorial staff were included in the corpus. This restriction was implemented to ensure consistency, re- liability and philological accuracy across the dataset. A primary selection criterion was the availability of essential contextual metadata, including the performance setting, the d...

  7. [7]

    This en- sures that the metrical constraints and the the- maticdevelopmentremainconstantacrossthe entire sample

    Generic Consistency:To avoid stylistic bias, the dataset exclusively comprises perfor- mances belonging to the same poetic genre, specifically(cantada logudoresa). This en- sures that the metrical constraints and the the- maticdevelopmentremainconstantacrossthe entire sample

  8. [8]

    metadata

    Linguistic Variety:The selection was re- stricted to a single linguistic variety of the Sar- dinian language (Logudorese). This choice eliminates lexical variation due to dialectal shifts, allowing the analysis to focus strictly on the individual poets’ lexical complexity and rhyming strategies. Despite the application of these selection crite- ria, the c...

  9. [9]

    Many poetic debates were found to be pub- lished multiple times under slightly different titles or categorized in different sections of the sourcewebsite

    Deduplication Record :A primary challenge was the presence of duplicate performances. Many poetic debates were found to be pub- lished multiple times under slightly different titles or categorized in different sections of the sourcewebsite. Theseredundantentrieswere identified and removed to ensure that the sta- tistical analysis of lexical frequency and ...

  10. [10]

    Variations in transcrip- tion, such as the inconsistent use of accents (e.g.,Màsalavs.Masala), were reconciled to a single canonical form

    EntityResolutionandNormalization:Toen- sure that each poet’s stylistic signature was correctly attributed, we performed a normaliza- tion of personal names. Variations in transcrip- tion, such as the inconsistent use of accents (e.g.,Màsalavs.Masala), were reconciled to a single canonical form. This step is crucial for the subsequent calculation of indivi...

  11. [11]

    Structural Integrity and Lacunae Flagging: Each stanza was checked automatically and manually to verify its completeness. Given the oral and often fragmented nature of the tran- scriptions, we adopted a symbolic tagging sys- tem within themetrical formmetadata field to maintain the chronological sequence of the de- bate without compromising the linguistic...

  12. [12]

    Incaseswherethe timingwasnotpresentintheoriginalsource,or when the stanza was incomplete (as indicated by the asterisk system), anull value was assigned to the field

    Temporal Standardization:The execution timeforeachstanza,recordedinthesourceas a string format (e.g.,1’00”), was parsed and converted into a discrete numerical variable representingtotalseconds. Incaseswherethe timingwasnotpresentintheoriginalsource,or when the stanza was incomplete (as indicated by the asterisk system), anull value was assigned to the fi...

  13. [13]

    Corpus statistics The resulting dataset,A Bolu, to the best of our knowledge constitutes the first structured digital corpus of Sardinian extemporaneous poetry ex- plicitly designed for computational analysis. By aggregating dispersed transcriptions and impos- ing a systematic structural organization, the corpus provides a solid foundation for rigorous em...

  14. [14]

    Onthefirstfront,ABolurepresents,toourknowl- edge, the first resource of its kind for thecantada logudoresa

    Discussion This study aimed to address two interconnected objectives: the construction of a structured digital corpus of Sardinian extemporaneous poetry suit- able for computational analysis and the empirical investigation of formulaic behavior within this tradi- tion. Onthefirstfront,ABolurepresents,toourknowl- edge, the first resource of its kind for th...

  15. [15]

    Conclusion and Future Works This study has introducedA Bolu, the first struc- tured digital corpus of Sardinian extemporaneous poetry designed for computational analysis, and has presented a preliminary investigation of lexical, temporal, and formulaic dimensions of thecantada logudoresa. The results provide preliminary empir- ical support for the oral-fo...

  16. [16]

    Bibliographical References Manuela Angioni, Franco Tuveri, Maurizio Virdis, Laura Lucia Lai, and Micol Elisa Maltesi. 2018. SardaNet: A linguistic resource for Sardinian lan- guage. InProceedings of the 9th Global Word- Net Conference, pages 412–419, Nanyang Tech- nological University (NTU), Singapore. Global Wordnet Association. R Harald Baayen. 2001.Wor...

  17. [17]

    2026.A Bolu: a Structured Dataset for the Computational Anal- ysis of Sardinian Improvisational Poetry

    Language Resource References Language Resources Silvio Calderaro and Johanna Monti. 2026.A Bolu: a Structured Dataset for the Computational Anal- ysis of Sardinian Improvisational Poetry. PID https://doi.org/10.5281/zenodo.19264263. Appendix A. Dialogic Mirroring: Full Stanzas and English Translations Original Sardinian Text Stanza #62 (Piras) Stanza #63 ...