A Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational Poetry
Pith reviewed 2026-05-10 01:45 UTC · model grok-4.3
The pith
Sardinian extemporaneous poetry shows recurring formulaic patterns that support Parry and Lord's theory of oral composition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord's theory of formulaicity. This is shown through the multidimensional analysis of the A Bolu corpus, which maps poetic text characteristics and provides evidence for formulaic structures in live improvisation.
What carries the argument
The A Bolu corpus of Sardinian cantada logudorese poetry, analyzed via a multidimensional combination of descriptive statistical indices and computational linguistics techniques to detect formulaic recurring patterns.
If this is right
- The corpus supplies concrete data for testing formulaic composition in living oral traditions beyond ancient examples.
- Similar structured datasets could be built for other improvised poetic forms to compare formulaicity across languages.
- The statistical and computational methods offer a template for quantifying oral creativity in minority language contexts.
- NLP tools for Sardinian and related languages can incorporate the identified patterns to handle extemporaneous text better.
Where Pith is reading between the lines
- The patterns might help train generative models for poetry in low-resource languages by providing examples of formulaic improvisation.
- Cross-checking the corpus against live recordings of performances could test whether the detected patterns hold in real-time delivery.
- The approach could extend to other performative genres like rap battles or storytelling to see if formulaicity appears universally.
Load-bearing premise
The chosen multidimensional analysis reliably detects formulaicity without being influenced by annotation choices or data selection in the corpus construction.
What would settle it
A re-run of the analysis on an independently collected and differently annotated set of Sardinian extemporaneous poetry that fails to show the same recurring patterns would undermine the claim of formulaicity.
read the original abstract
The growing interest of Natural Language Processing (NLP) in minority languages has not yet bridged the gap in the preservation of oral linguistic heritage. In particular, extemporaneous poetry - a performative genre based on real-time improvisation, metrical-rhetorical competence - remains a largely unexplored area of computational linguistics. This methodological gap necessitates the creation of specific resources to document and analyse the structures of improvised poetry. This is the context in which A Bolu was created, the first structured corpus of extemporaneous poetry dedicated to cantada logudorese, a variant of the Sardinian language. The dataset comprises 2,835 stanzas for a total of 141,321 tokens. The study presents the architecture of the corpus and applies a multidimensional analysis combining descriptive statistical indices and computational linguistics techniques to map the characteristics of the poetic text. The results indicate that the production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord's theory of formulaicity. This evidence not only provides a new key to understanding oral creativity, but also offers a significant contribution to the development of NLP tools that are more inclusive and sensitive to the specificities of less widely spoken languages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces A Bolu, the first structured corpus of Sardinian extemporaneous poetry (cantada logudorese) with 2,835 stanzas and 141,321 tokens. It details the corpus architecture and applies a multidimensional analysis using descriptive statistical indices together with computational linguistics techniques, reporting recurring patterns that the authors interpret as supporting Parry and Lord's theory of formulaicity in oral poetry.
Significance. The creation of this dedicated dataset for a minority-language oral tradition is a clear strength and provides a new resource for computational philology and NLP in low-resource settings. If the analysis can be strengthened, the work could advance understanding of improvisational poetic structures and support more inclusive language technologies.
major comments (1)
- [Results] Results section: The central claim that recurring patterns support Parry-Lord formulaicity is load-bearing for the paper's interpretive contribution, yet the analysis lacks a control corpus of non-improvisational Sardinian poetry (written or composed) subjected to the identical set of descriptive indices and computational metrics. Without this comparison, observed n-gram repetitions or structural regularities cannot be distinguished from generic effects of Sardinian meter, rhyme, or poetic conventions.
minor comments (2)
- [Abstract] Abstract: The description of the 'multidimensional analysis' would benefit from naming the specific statistical indices and computational techniques applied, as this directly affects assessment of the reported patterns.
- [Corpus Description] Corpus construction: Additional details on tokenization, stanza segmentation criteria, and any validation of the 2,835-stanza collection would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address the single major comment below and have revised our interpretation accordingly.
read point-by-point responses
-
Referee: Results section: The central claim that recurring patterns support Parry-Lord formulaicity is load-bearing for the paper's interpretive contribution, yet the analysis lacks a control corpus of non-improvisational Sardinian poetry (written or composed) subjected to the identical set of descriptive indices and computational metrics. Without this comparison, observed n-gram repetitions or structural regularities cannot be distinguished from generic effects of Sardinian meter, rhyme, or poetic conventions.
Authors: We agree that this is a substantive limitation. Our current analysis identifies recurring n-gram and structural patterns in the improvisational corpus and interprets them as consistent with Parry and Lord's oral-formulaic theory, but without a matched control corpus of non-improvised Sardinian poetry we cannot rule out that some regularities stem from the language's metrical and rhyming conventions more broadly. In the revised manuscript we will (i) rephrase the central claim to state that the patterns are consistent with rather than direct support for the theory, (ii) add an explicit discussion of this limitation in the Results and Discussion sections, and (iii) outline the construction of a future control corpus as necessary follow-up work. These changes will be incorporated in the next version. revision: yes
Circularity Check
Corpus creation and descriptive analysis without circular derivations
full rationale
The paper's derivation chain consists of corpus construction followed by application of descriptive statistical indices and computational linguistics techniques to identify recurring patterns in the extemporaneous poetry. These patterns are presented as evidence supporting Parry and Lord's theory of formulaicity. Since no mathematical models, parameter fittings, or predictions are involved that could be equivalent to the inputs by construction, and no self-citations are used to justify core premises, there is no circularity. The results are direct observations from the 2,835-stanza dataset, making the study self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction and Background The growing interest of the Natural Language Processing (NLP) community in low-resource lan- guages, minority varieties and dialects reflects a broader shift toward linguistic inclusivity. For decades, computational tools and annotated re- sources were concentrated on a small set of high- resourcelanguages,leavingthevastmajorit...
2022
-
[2]
Digital Preservation and Resource Cre- ation:We provide a high-fidelity digital repos- itory for a vulnerable minority language tradi- tion, preventing the loss of undocumented or fragmented transcriptions and establishing a foundation for future NLP tasks in Sardinian
-
[3]
MultidimensionalDataModeling:Unlikeflat- text corpora,A Boluis structured to include richmetadata—suchasthematicassignments, performer identifiers and precise execution timestamps per stanza—modeled in a hierar- arXiv:2604.19584v1 [cs.CL] 21 Apr 2026 chical format to facilitate complex relational queries
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
stylistic signatures
Computational Stylistics Analysis:We demonstratetheutilityofthedatasetbypropos- ing it as a benchmark for investigating "stylistic signatures" and lexical complexity, enabling quantitative research into how real-time impro- visational pressures affect the linguistic and metrical choices of thecantadores. The aim of this resource is to integrate Sardinian ...
-
[5]
Methodology This section describes the methodological ap- proach adopted for the construction of the corpus, divided into the phases of data acquisition 2.1, archive structuring 2.2 and curation of the raw ma- terial 2.3. The central objective was to transform a heritage of oral tradition—fragmentary and dis- continuous by nature—into a structured digital...
2014
-
[6]
This restriction was implemented to ensure consistency, re- liability and philological accuracy across the dataset
Transcription Quality and Metadata Rich- ness:produced by the official editorial staff were included in the corpus. This restriction was implemented to ensure consistency, re- liability and philological accuracy across the dataset. A primary selection criterion was the availability of essential contextual metadata, including the performance setting, the d...
-
[7]
This en- sures that the metrical constraints and the the- maticdevelopmentremainconstantacrossthe entire sample
Generic Consistency:To avoid stylistic bias, the dataset exclusively comprises perfor- mances belonging to the same poetic genre, specifically(cantada logudoresa). This en- sures that the metrical constraints and the the- maticdevelopmentremainconstantacrossthe entire sample
-
[8]
metadata
Linguistic Variety:The selection was re- stricted to a single linguistic variety of the Sar- dinian language (Logudorese). This choice eliminates lexical variation due to dialectal shifts, allowing the analysis to focus strictly on the individual poets’ lexical complexity and rhyming strategies. Despite the application of these selection crite- ria, the c...
-
[9]
Many poetic debates were found to be pub- lished multiple times under slightly different titles or categorized in different sections of the sourcewebsite
Deduplication Record :A primary challenge was the presence of duplicate performances. Many poetic debates were found to be pub- lished multiple times under slightly different titles or categorized in different sections of the sourcewebsite. Theseredundantentrieswere identified and removed to ensure that the sta- tistical analysis of lexical frequency and ...
-
[10]
Variations in transcrip- tion, such as the inconsistent use of accents (e.g.,Màsalavs.Masala), were reconciled to a single canonical form
EntityResolutionandNormalization:Toen- sure that each poet’s stylistic signature was correctly attributed, we performed a normaliza- tion of personal names. Variations in transcrip- tion, such as the inconsistent use of accents (e.g.,Màsalavs.Masala), were reconciled to a single canonical form. This step is crucial for the subsequent calculation of indivi...
-
[11]
Structural Integrity and Lacunae Flagging: Each stanza was checked automatically and manually to verify its completeness. Given the oral and often fragmented nature of the tran- scriptions, we adopted a symbolic tagging sys- tem within themetrical formmetadata field to maintain the chronological sequence of the de- bate without compromising the linguistic...
-
[12]
Incaseswherethe timingwasnotpresentintheoriginalsource,or when the stanza was incomplete (as indicated by the asterisk system), anull value was assigned to the field
Temporal Standardization:The execution timeforeachstanza,recordedinthesourceas a string format (e.g.,1’00”), was parsed and converted into a discrete numerical variable representingtotalseconds. Incaseswherethe timingwasnotpresentintheoriginalsource,or when the stanza was incomplete (as indicated by the asterisk system), anull value was assigned to the fi...
-
[13]
Corpus statistics The resulting dataset,A Bolu, to the best of our knowledge constitutes the first structured digital corpus of Sardinian extemporaneous poetry ex- plicitly designed for computational analysis. By aggregating dispersed transcriptions and impos- ing a systematic structural organization, the corpus provides a solid foundation for rigorous em...
2001
-
[14]
Onthefirstfront,ABolurepresents,toourknowl- edge, the first resource of its kind for thecantada logudoresa
Discussion This study aimed to address two interconnected objectives: the construction of a structured digital corpus of Sardinian extemporaneous poetry suit- able for computational analysis and the empirical investigation of formulaic behavior within this tradi- tion. Onthefirstfront,ABolurepresents,toourknowl- edge, the first resource of its kind for th...
1987
-
[15]
Conclusion and Future Works This study has introducedA Bolu, the first struc- tured digital corpus of Sardinian extemporaneous poetry designed for computational analysis, and has presented a preliminary investigation of lexical, temporal, and formulaic dimensions of thecantada logudoresa. The results provide preliminary empir- ical support for the oral-fo...
-
[16]
Bibliographical References Manuela Angioni, Franco Tuveri, Maurizio Virdis, Laura Lucia Lai, and Micol Elisa Maltesi. 2018. SardaNet: A linguistic resource for Sardinian lan- guage. InProceedings of the 9th Global Word- Net Conference, pages 412–419, Nanyang Tech- nological University (NTU), Singapore. Global Wordnet Association. R Harald Baayen. 2001.Wor...
-
[17]
Language Resource References Language Resources Silvio Calderaro and Johanna Monti. 2026.A Bolu: a Structured Dataset for the Computational Anal- ysis of Sardinian Improvisational Poetry. PID https://doi.org/10.5281/zenodo.19264263. Appendix A. Dialogic Mirroring: Full Stanzas and English Translations Original Sardinian Text Stanza #62 (Piras) Stanza #63 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.