pith. sign in

arxiv: 2511.01619 · v2 · submitted 2025-11-03 · 💻 cs.CL

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

Pith reviewed 2026-05-18 00:57 UTC · model grok-4.3

classification 💻 cs.CL
keywords parliamentary corporaspoken languageSlavic languagesautomatic annotationsentiment analysisfilled pausesforced alignmentacoustic analysis
0
0 comments X

The pith

Automatic annotations turn parliamentary speech recordings into resources for studying sentiment and disfluencies in four languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases version 3.0 of ParlaSpeech, spoken parliamentary corpora covering Croatian, Czech, Polish and Serbian for a total of roughly 6000 hours. The update adds automatic layers of linguistic processing and sentiment predictions to the text side, plus filled-pause detection on the audio side, with word-level alignments and primary stress marking added for two of the languages. The authors show the added value through an analysis that links acoustic properties of the speech to the predicted sentiment labels. By making the data available in standard formats and through a concordancer, the work aims to support studies in linguistics, acoustics and related fields that would otherwise require building these layers from scratch.

Core claim

The central claim is that automatic pipelines applied to existing ParlaMint transcripts and aligned parliamentary recordings produce enriched corpora whose textual modality now carries linguistic annotations and sentiment predictions while the spoken modality carries filled-pause markers, and two languages further receive detailed word- and grapheme-level alignments together with automatic primary-stress positions; these additions are presented as having drastically increased the corpora’s usefulness for downstream research, illustrated by an acoustic analysis of sentiment.

What carries the argument

Automatic annotation pipelines that add sentiment prediction, filled-pause detection, forced alignment and stress marking to the base ParlaMint-derived parliamentary recordings and transcripts.

If this is right

  • The data now directly supports studies that link acoustic features such as pitch, duration or intensity to sentiment in parliamentary speech.
  • Filled-pause annotations enable analysis of disfluency patterns in formal political speech across the four languages.
  • Linguistic tags allow syntactic and semantic investigation of spoken rather than written parliamentary language.
  • Standard JSONL and TextGrid formats let researchers load the corpora into concordancers or speech-processing pipelines without further preprocessing.
  • Cross-linguistic comparisons of how sentiment is expressed acoustically become feasible within the Slavic parliamentary domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The resource could serve as seed data for training sentiment-aware speech models in languages with limited labeled audio.
  • Combining the new stress and alignment layers with sentiment scores might help model how prosody conveys attitude in political talk.
  • Similar automatic enrichment steps could be applied to parliamentary recordings in additional languages to create comparable cross-lingual collections.
  • The acoustic-sentiment findings open the possibility of testing whether the same correlates hold in spontaneous rather than parliamentary speech.

Load-bearing premise

The automatic pipelines for sentiment prediction, filled-pause detection, alignment and stress marking produce labels accurate enough to support the claimed downstream research value.

What would settle it

A manual audit on a held-out sample that finds frequent sentiment mislabels or alignment errors large enough to alter the acoustic-sentiment correlation results would undermine the usefulness claim.

Figures

Figures reproduced from arXiv: 2511.01619 by Ivan Porupski, Nikola Ljube\v{s}i\'c, Peter Rupnik, Taja Kuzman Punger\v{s}ek.

Figure 1
Figure 1. Figure 1: Relative distribution of the number of speakers by year-of-birth and gender across all four [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the P(Neg > Pos) effect size on three strong sentiment predictors – pitch (F0), intensity (Int) and speech rate (SR) – across the four languages. Speaker average and instance results are shown. Statistically non-significant results (instance-level speech rate in Czech and Polish) are omitted from the plot. the speaker-average level, pitch (F0) and intensity emerged as the most robust indic… view at source ↗
read the original abstract

ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the underlying corpora has been drastically increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents ParlaSpeech 3.0, an extension of spoken parliamentary corpora covering Croatian, Czech, Polish, and Serbian (totaling approximately 6,000 hours). Built from ParlaMint transcripts and aligned audio, the release adds automatic layers: linguistic annotations and sentiment predictions to the textual modality; filled-pause detection to the spoken modality; and, for two languages, word-/grapheme-level alignments plus automatic primary-stress marking. The authors state that these enrichments 'drastically increase' the corpora’s usefulness for downstream research across disciplines and illustrate the point with an analysis of acoustic correlates of sentiment. Data are released in JSONL and TextGrid formats and via a concordancer.

Significance. A large-scale, multilingual spoken parliamentary resource with added sentiment, disfluency, alignment, and prosodic layers would be a valuable contribution to computational linguistics, speech processing, and discourse studies, particularly for Slavic languages. The multi-format release and search interface are practical strengths. However, because the central claim of drastically increased usefulness rests on the reliability of the automatic labels, and no quantitative validation is reported, the practical significance for downstream users remains difficult to assess at present.

major comments (2)
  1. [Abstract / Automatic annotation pipelines] Abstract and the section describing the automatic annotation pipelines: the claim that the enrichments 'drastically increase' usefulness for downstream research is not supported by any reported accuracy, precision/recall, or error analysis for the sentiment classifier or the filled-pause detector. Without these figures, the acoustic-correlates-of-sentiment showcase risks reflecting model artifacts rather than genuine linguistic patterns.
  2. [Showcase analysis] Showcase analysis section: the acoustic-sentiment correlations are presented without any discussion of how label noise in the automatic sentiment predictions or pause annotations might affect the observed relationships, leaving the demonstration of increased usefulness incomplete.
minor comments (2)
  1. [Data description] The abstract states that two of the four languages receive alignments and stress marking, but the main text should explicitly name which two languages receive these layers for clarity.
  2. [Corpus statistics] Consider adding a short table summarizing the size (hours or tokens) and annotation coverage per language to make the scale of the resource immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the comments concerning the lack of quantitative validation for the automatic annotations and the incomplete discussion of label noise in the showcase analysis. Below we respond to each major comment point by point, indicating the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Automatic annotation pipelines] Abstract and the section describing the automatic annotation pipelines: the claim that the enrichments 'drastically increase' usefulness for downstream research is not supported by any reported accuracy, precision/recall, or error analysis for the sentiment classifier or the filled-pause detector. Without these figures, the acoustic-correlates-of-sentiment showcase risks reflecting model artifacts rather than genuine linguistic patterns.

    Authors: We agree that the original manuscript did not report explicit performance metrics for the sentiment classifier and filled-pause detector. The annotations rely on established models whose performance is documented in the cited literature, but we acknowledge that direct figures and error analysis would better support the usefulness claim. In the revised manuscript we have added a new subsection on annotation quality that reports precision, recall, and accuracy figures obtained from our internal evaluations and the source model papers. We have also revised the abstract and introduction to state that the enrichments 'substantially increase' usefulness, which more accurately reflects the evidence provided. A short error analysis has been included to address the risk of model artifacts in the showcase. revision: yes

  2. Referee: [Showcase analysis] Showcase analysis section: the acoustic-sentiment correlations are presented without any discussion of how label noise in the automatic sentiment predictions or pause annotations might affect the observed relationships, leaving the demonstration of increased usefulness incomplete.

    Authors: We concur that an explicit discussion of label noise is required to make the demonstration of increased usefulness complete. The revised manuscript expands the showcase analysis section with a dedicated paragraph that considers how noise in the automatic sentiment predictions and pause annotations could affect the observed acoustic correlations. This includes qualitative assessment of potential biases and practical guidance for users on interpreting results in the presence of such noise. revision: yes

Circularity Check

0 steps flagged

No circularity: resource paper with no derivations or fitted predictions

full rationale

The manuscript is a data-resource release that describes automatic pipelines for adding linguistic annotations, sentiment labels, filled-pause detection, alignments, and stress marking to existing parliamentary recordings. No equations, model training loops, or quantitative predictions are presented that could reduce to their own inputs by construction. The showcase acoustic-correlates analysis is offered only as an illustration of downstream utility, not as a derived result whose validity depends on a self-referential fit or self-citation chain. All load-bearing steps (corpus construction and annotation) are external to any claimed derivation, making the paper self-contained against the circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on existing automatic annotation tools and prior parliamentary transcripts; no new free parameters, mathematical axioms, or postulated entities are introduced.

pith-pipeline@v0.9.0 · 5757 in / 1047 out tokens · 36269 ms · 2026-05-18T00:57:14.400138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    Introduction Spoken corpora remain scarce compared to their written counterparts, largely due to the technical, logisticalandlegalchallengesoftheircreation. Col- lecting natural speech often requires fieldwork, se- curing participant consent, and encouraging au- thentic conversation, while transcription and addi- tionalenrichmentdemandssignificanttime,spe...

  2. [2]

    ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

    Related Work There is a number of languages already repre- sented via parliamentary spoken corpora, e.g., arXiv:2511.01619v1 [cs.CL] 3 Nov 2025 Virkkunen et al. (2023) list corpora of 10 differ- ent languages. However, most of these corpora are primarily aimed at supporting development and improvement of automatic speech recognition (ASR) systems. A sligh...

  3. [3]

    Regarding corpora of typical speakers, the Switchboardcorpus(Godfreyetal.,1992)hasman- ually added disfluency markers, including filled pauses

    corpus contains 5.3 hours of annotated Ger- man speech from people who stutter, and the En- glish SEP-28k (Lea et al., 2021) corpus consists of stuttering speakers with segment-level annotations of filled pauses. Regarding corpora of typical speakers, the Switchboardcorpus(Godfreyetal.,1992)hasman- ually added disfluency markers, including filled pauses. ...

  4. [4]

    ParlaSpeech extends ParlaMint with recordings from parliaments aligned on the sentence level via an alignment procedure described in Ljubešić et al

    Dataset The ParlaSpeech dataset is a derivative of the Par- laMint (Erjavec et al., 2025) corpora, the result of the CLARIN ERIC flagship project, which provided comparable encodings of transcripts of 29 Euro- pean national and regional parliaments, together with rich speaker metadata. ParlaSpeech extends ParlaMint with recordings from parliaments aligned...

  5. [5]

    Currently, Croatian and Serbian corpora contain all annotation layers, while Czech and Pol- ish include only filled pause, sentiment and linguis- tic annotations

    Dataset Enrichments ParlaSpeech 3.0 extends the base ParlaSpeech dataset (Ljubešić et al., 2022, 2024) with five anno- tation layers: linguistic annotations (ParlaSpeech- Ling following Universal Dependencies), senti- ment (ParlaSpeech-Senti), filled pause detec- tion (ParlaSpeech-Pause), precise word-level and grapheme-level alignments (ParlaSpeech-Align...

  6. [6]

    Dataset Encoding To facilitate downstream research, we have en- coded the dataset into three different formats. JSONL. The primary format of the corpora is the JSONL (JavaScript Object Notation Lines) file format that contains all the layer data, including speaker metadata. This format is uniquely suited for computational processing because each line is a...

  7. [7]

    Use Case In this section, we discuss one of the many pos- sible use cases of the ParlaSpeech 3.0 dataset – investigatingthedifferencesinacousticfeaturesbe- tween speech with positive and speech with nega- tive sentiment. Does a speaker’s intonation (pitch), loudness (intensity) or speech rate vary in different sentiments? This use case cross-references th...

  8. [8]

    The textual modality of the corpora has been au- tomatically enriched with linguistic and sentiment annotations

    Conclusion This paper presented the collection of parliamen- taryspokencorporaoffourSlaviclanguages, span- ning together more than six thousand hours in size. The textual modality of the corpora has been au- tomatically enriched with linguistic and sentiment annotations. The spoken modality has been anno- tated with filled pauses in all four languages, wh...

  9. [9]

    Owing to the public nature of the recordings, we are able to freely enrich and share these data, thereby facilitating further en- richments and analyses by third parties

    Ethical Considerations and Limitations This paper presents a dataset comprising over six thousand hours of human speech in four less- resourced languages. Owing to the public nature of the recordings, we are able to freely enrich and share these data, thereby facilitating further en- richments and analyses by third parties. However, responsible use by dow...

  10. [10]

    Spoken Language Resources and Speech Tech- nologies for the Slovenian Language

    Acknowledgments This work was supported in part by the projects “Spoken Language Resources and Speech Tech- nologies for the Slovenian Language” (Grant J7- 4642), “Large Language Models for Digital Hu- manities” (Grant GC-0002), the research pro- gramme “Language Resources and Technologies for Slovene” (Grant P6-0411), all funded by the ARIS Slovenian Res...

  11. [11]

    Bibliographical References Itai Allouche, Itay Asael, Rotem Rousso, Vered Dassa, Ann Bradlow, Seung-Eun Kim, Matthew Goldrick, and Joseph Keshet. 2025. How Does a Deep Neural Network Look at Lexical Stress? arXiv preprint arXiv:2508.07229. ES Atwell, PA Howarth, and DC Souter. 2003. The ISLE corpus: Italian and German spoken learner’s English.ICAME Journa...

  12. [12]

    InPro- ceedingsoftheeleventhinternationalconference on language resources and evaluation (LREC 2018)

    CLARIN’s key resource families. InPro- ceedingsoftheeleventhinternationalconference on language resources and evaluation (LREC 2018). John J Godfrey, Edward C Holliman, and Jane McDaniel. 1992. SWITCHBOARD: Telephone speech corpus for research and development. In Acoustics, speech, and signal processing, ieee international conference on, volume 1, pages 5...

  13. [13]

    InInternationalCon- ference on Speech and Computer, pages 137–

    The ParlaSpeech Collection of Automati- cally Generated Speech and Text Datasets from ParliamentaryProceedings. InInternationalCon- ference on Speech and Computer, pages 137–

  14. [14]

    Nikola Ljubešić, Peter Rupnik, Ivan Porupski, Nejc Robida, and Mirna Potočnjak

    Springer. Nikola Ljubešić, Peter Rupnik, Ivan Porupski, Nejc Robida, and Mirna Potočnjak. 2025. Dataset for primary stress identification in Croatian and related languages and dialects. Slovenian lan- guage resource repository CLARIN.SI. Nikola Ljubešić, Peter Rupnik, and Rik van Noord

  15. [15]

    Hugging Face

    XLM-R-Parla(Revision8c7f9b2). Hugging Face. Robbie Love, Claire Dembry, Andrew Hardie, Va- clav Brezina, and Tony McEnery. 2017. The Spo- ken BNC2014: Designing and building a spoken corpus of everyday conversations.International Journal of Corpus Linguistics, 22(3):319–344. Yanying Mao, Qun Liu, and Yu Zhang. 2024. Sen- timent analysis methods, applicati...

  16. [16]

    InProceedings of Interspeech 2017, pages 498–502

    Montreal forced aligner: Trainable text- speech alignment using kaldi. InProceedings of Interspeech 2017, pages 498–502. Michal Mochtak, Peter Rupnik, Taja Kuz- man Pungeršek, and Nikola Ljubešić. 2025. ParlaSent: Mapping Sentiment in Political Dis- course with Large Language Models.Political Research Exchange, 7(1). Michal Mochtak, Peter Rupnik, and Niko...

  17. [17]

    InProceedings of the 2024 Joint In- ternational Conference on Computational Lin- guistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16024–16036, Torino, Italia

    TheParlaSentmultilingualtrainingdataset for sentiment identification in parliamentary pro- ceedings. InProceedings of the 2024 Joint In- ternational Conference on Computational Lin- guistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16024–16036, Torino, Italia. ELRA and ICCL. Michal Mochtak, Peter Rupnik, Katja Meden, and Nikola Ljubeši...

  18. [18]

    Hugging Face

    XLM-R-ParlaSent (Revision a04de02). Hugging Face. Peter Rupnik, Nikola Ljubešić, Ivan Porupski, and Darinka Verdonik. 2024. wav2vecbert2- filledpause (revision 5e75061). Hugging Face. Svetlana Savchuk. 2009. Spoken texts representa- tionintheRussianNationalCorpus: Spokenand Accentologic sub-corpora. InNLP, Corpus Lin- guistics,CorpusBasedGrammarResearch.F...

  19. [19]

    Disfluencies and Human Speech Transcription Errors

    Slue: New benchmark tasks for spoken language understanding evaluation on natural speech. InICASSP 2022-2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7927–7931. IEEE. PerErikSolbergandPabloOrtiz.2022. TheNorwe- gian parliamentary speech corpus. InProceed- ings of the Thirteenth Language Resources and Evalu...