pith. sign in

arxiv: 2605.27062 · v1 · pith:BTSJHMU2new · submitted 2026-05-26 · 💻 cs.CL · cs.LG

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

Pith reviewed 2026-06-29 18:05 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords European Portuguesespeech corpusautomatic speech recognitionparliamentary sessionsspeaker annotationpre-training dataword error rate
0
0 comments X

The pith

A new 5800-hour European Portuguese speech corpus from parliamentary sessions improves ASR word error rates by up to 14 percent when used for pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates FalAR, a large dataset of European Portuguese speech drawn from 20 years of parliamentary recordings. It supplies 5800 hours of audio with speaker labels and metadata for 1180 people, filling a gap left by the much larger resources available for Brazilian Portuguese. The corpus is assembled by aligning audio to automatic transcriptions from an existing EP ASR model. Experiments then show that pre-training on FalAR produces up to 14 percent relative reduction in word error rate compared with baseline models. This matters for any downstream application that needs accurate speech recognition for the roughly 11 million speakers of European Portuguese.

Core claim

The authors present FalAR as a 5800-hour speaker-annotated corpus of European Portuguese parliamentary speech spanning approximately 20 years, with 4850 hours carrying identity metadata for 1180 speakers. The corpus is constructed by using a state-of-the-art EP CAMÕES ASR model to produce reference transcriptions for alignment. When FalAR is added as pre-training data, ASR models achieve up to 14 percent relative WER improvement over baselines that do not use it.

What carries the argument

The FalAR corpus itself, assembled through automatic transcription-reference alignment of parliamentary audio, functions as large-scale pre-training material that directly improves downstream ASR accuracy on European Portuguese.

If this is right

  • ASR systems trained with FalAR data perform better on European Portuguese speech than those trained only on existing resources.
  • Speaker metadata in the corpus supports experiments that separate performance by age, gender, or political role.
  • Increasing the amount of aligned data from the corpus improves both alignment quality and final model accuracy.
  • The parliamentary-domain recordings provide a consistent acoustic and linguistic setting for studying long-form speech recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • FalAR could be combined with Brazilian Portuguese corpora to study cross-variety transfer in ASR.
  • The speaker annotations open the possibility of building models that adapt to individual parliamentary speakers or demographic groups.
  • Similar collection pipelines could be applied to other languages that maintain public parliamentary recordings but lack large speech datasets.
  • The 20-year span may allow studies of language change or diachronic shifts in parliamentary speech.

Load-bearing premise

The automatic transcriptions from the existing CAMÕES ASR model are accurate enough to support reliable alignment and useful model training.

What would settle it

Human verification on a random sample of the aligned transcriptions showing high word error rates, or new ASR models trained on FalAR failing to produce measurable WER gains on independent European Portuguese test sets.

Figures

Figures reproduced from arXiv: 2605.27062 by Alberto Abad, Ben Peters, Carlos Carvalho, Catarina Botelho, Francisco Teixeira, Isabel Trancoso, Mariana Juli\~ao, Rub\'en Solera-Ure\~na, S\'ergio Paulo, Thomas Rolland.

Figure 1
Figure 1. Figure 1: FalAR data collection and processing pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FalAR demographic distribution. and test splits for the subset of the corpus con￾taining speaker-level information, to promote repro￾ducible research with this corpus. These partitions correspond to FalAR_train.csv, FalAR_dev.csv, and FalAR_test.csv in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data. In addition, 4,850 hours have speaker identity annotations, for a total of 1,180 speakers with associated metadata including age, gender, political affiliation, and parliamentary role. The corpus was built using a state-of-the-art EP CAM\~OES ASR model for transcription-reference alignment. In this paper, we describe the data collection process, together with the main characteristics of the FalAR corpus. Furthermore, we evaluate the trade-off between data quantity and alignment accuracy on ASR performance, with our experiments demonstrating that incorporating FalAR as pre-training data yields up to 14% relative WER improvement over baseline models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FalAR, a 5,800-hour speaker-annotated European Portuguese speech corpus drawn from 20 years of parliamentary sessions. It describes corpus construction via the CAMÕES ASR model for transcription-reference alignment, supplies speaker metadata (age, gender, affiliation, role) for 1,180 speakers across 4,850 hours, and reports ASR experiments in which pre-training on FalAR yields up to 14% relative WER improvement over baselines.

Significance. If the automatic alignments prove reliable, FalAR would constitute a substantial new resource for an under-resourced language variety, enabling both scale and speaker-conditioned modeling that current EP corpora lack. The empirical WER gains, if reproducible with full experimental controls, would directly demonstrate the corpus's downstream utility.

major comments (2)
  1. [§4] §4 (ASR experiments): the reported 'up to 14% relative WER improvement' is presented without baseline model specifications, training/validation/test splits, or error bars, preventing assessment of whether the gain is robust or attributable to the new data.
  2. [§3.2] §3.2 (alignment procedure): no WER, CER, or alignment-error figures are supplied for the CAMÕES model on any human-annotated subset of FalAR, leaving the central assumption that the pseudo-labels are sufficiently accurate for both corpus construction and pre-training untested.
minor comments (2)
  1. [Introduction] The abstract and introduction could more explicitly contrast FalAR with existing EP resources (e.g., size, speaker coverage, domain).
  2. [§3] Speaker metadata statistics (distribution of age/gender/affiliation) would benefit from a dedicated table or figure for quick reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and commit to revisions that strengthen the manuscript's clarity and reproducibility.

read point-by-point responses
  1. Referee: [§4] §4 (ASR experiments): the reported 'up to 14% relative WER improvement' is presented without baseline model specifications, training/validation/test splits, or error bars, preventing assessment of whether the gain is robust or attributable to the new data.

    Authors: We agree that the experimental details are insufficient for full reproducibility and assessment. The revised manuscript will specify the baseline model architectures and hyperparameters, explicitly describe the train/validation/test splits (including how FalAR data was partitioned), and report error bars or results across multiple random seeds to quantify variability in the observed WER gains. revision: yes

  2. Referee: [§3.2] §3.2 (alignment procedure): no WER, CER, or alignment-error figures are supplied for the CAMÕES model on any human-annotated subset of FalAR, leaving the central assumption that the pseudo-labels are sufficiently accurate for both corpus construction and pre-training untested.

    Authors: The current version does not include quantitative alignment error metrics on a human-annotated subset. We will add a dedicated evaluation subsection reporting WER and CER of the CAMÕES model on any available human-annotated parliamentary data (or a newly annotated sample if feasible), along with a discussion of how alignment quality affects downstream pre-training utility. revision: yes

Circularity Check

0 steps flagged

Empirical ASR improvement measured on held-out data; no derivation reduces to inputs

full rationale

The paper's central claim is an empirical measurement: pre-training on FalAR yields up to 14% relative WER improvement over baselines. This is obtained by training ASR models on the new corpus (built via CAMÕES alignment) and evaluating WER on separate test sets. No equations, fitted parameters, or self-citations are invoked to derive the gain; the result is a direct experimental outcome rather than a quantity forced by the corpus construction. The CAMÕES labeling step is a data-generation choice whose quality is assumed but not part of any closed derivation loop within the paper. The reported trade-off experiments between quantity and alignment accuracy are likewise empirical comparisons, not self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical corpus construction paper; no mathematical derivations, fitted constants, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5819 in / 964 out tokens · 23715 ms · 2026-06-29T18:05:37.001699+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

    Introduction Recent advances in Automatic Speech Recogni- tion (ASR) have been driven by a combination of architectural innovations (Dong et al., 2018; Karita et al., 2019; Gulati et al., 2020; Kim et al., 2023; Rekesh et al., 2023), increased computa- tional power, and the growing availability of large- scale labeled speech corpora (Chan et al., 2021; Ra...

  2. [2]

    Related Work The development of speech resources for EP has always been closely linked to the development of EP ASR systems, as evidenced by early exam- ples of data collection efforts for EP speech tech- nologies. For instance, BD-PUBLICO (Neto et al., 1997), a corpus of 25 hours of read newspaper articles was collected to be used as the training data of...

  3. [3]

    for the development of speech-based tech- nologies. Likewise, ALERT (Trancoso et al., 2003), a broadcast news corpus comprising 74 hours of speech, was collected for and used to train the hybrid HMM/DNN AUDIMUS system, developed to automatically transcribe broadcast news in EP (Meinedo et al., 2001; Neto et al., 2008). As ASR architectures became more dat...

  4. [4]

    In addition, there is a growing number of speech corpora compiled from parliamentary data

    and ParlaMint-PT (Aires et al., 2024), with transcriptions of debates spanning 1976 to 2019 and 2005 to 2019, respectively. In addition, there is a growing number of speech corpora compiled from parliamentary data. These include large-scale corpora, such as Europarl-ASR (GarcésDíaz-Muníoetal.,2021)forEnglishaswell as corpora for under-represented language...

  5. [5]

    Diário da República

    FalAR The main objective of this work is to build a large- scale speech corpus for EP, leveraging the publicly available video recordings of Portuguese parlia- mentary meetings and corresponding manual tran- scriptions. To achieve this, we collected the recordings, ex- tracted and segmented the audio signals, and gen- erated automatic transcriptions. Thes...

  6. [6]

    Data To assess the impact of the proposed corpus on the performance of ASR models for European Por- tuguese, we conduct a series of experiments using different data configurations

    Experimental Setup 4.1. Data To assess the impact of the proposed corpus on the performance of ASR models for European Por- tuguese, we conduct a series of experiments using different data configurations. First, to determine the impact that different align- ment error rates have in downstream ASR sys- tems, we prepare five subsets of FalAR to train corres...

  7. [7]

    An ASR model was ad- ditionally trained using solely CAMÕES, to provide a baseline with which to compare the FalAR-based models to

    is 425 hours-long, whereas the test set is 46 hours-long, both comprising five domains, namely, read speech (RS), broadcast news (BN), talks and lectures (T/L), conversational speech (CS), and so- ciolinguistic interviews (SI). An ASR model was ad- ditionally trained using solely CAMÕES, to provide a baseline with which to compare the FalAR-based models t...

  8. [8]

    More specifically, we followed the LibriSpeech 960 (Panayotov et al., 2015) recipe in ESPnet for training, decoding, and evaluation

    for the core implementation and evaluation of our work. More specifically, we followed the LibriSpeech 960 (Panayotov et al., 2015) recipe in ESPnet for training, decoding, and evaluation. All evaluated ASR models correspond to an E- Branchformer(Kimetal.,2023)with144Mtrainable parameters, using 8x downsampling (Rekesh et al.,

  9. [9]

    Themodel’s encodercomprises17layers, whereasthedecoder isa6-layerTransformer,bothadaptedfromtheorig- inal recipe

    and Flash Attention (Dao et al., 2022) to im- provetrainingandinferenceefficiency. Themodel’s encodercomprises17layers, whereasthedecoder isa6-layerTransformer,bothadaptedfromtheorig- inal recipe. For the encoder module, we applied Rotary Positional Embeddings (RoPE) (Su et al., 2024). We also adopted a piecewise-linear learn- ing rate schedule (Peng et a...

  10. [10]

    In-domain performance Table2presentsthein-domainresultsfortheFalAR test set together with the out-of-domain perfor- mance on the CAMÕES benchmark, evaluated across its five domains

    Results 5.1. In-domain performance Table2presentsthein-domainresultsfortheFalAR test set together with the out-of-domain perfor- mance on the CAMÕES benchmark, evaluated across its five domains. For the FalAR test set, we observe that per- formance generally improves as the training data size increases, with the largest gain and best overall performance o...

  11. [11]

    Conclusions This work introduces FalAR, to the best of our knowledge, the largest publicly available annotated European Portuguese speech corpus, totalling 5,800 hours of parliamentary speech data. Our results show that using FalAR as pre-training data followed by in-domain fine-tuning improves ASR performance across all domains of the CAMÕES benchmark wh...

  12. [12]

    Ethical considerations and limitations The source data that we curated and analysed to compileFalARwasobtainedfrompubliclyavailable open data resources (see Section 3.1). In releasing the accompanying metadata, we deliberately omit personally identifiable informa- tion such as speaker names and dates of birth, and instead provide anonymised speaker identi...

  13. [13]

    (FCT) under projects UID/50021/2025 (DOI: https://doi.org/10.54499/UID/50021/

    Acknowledgements Work supported by Portuguese national funds through Fundação para a Ciência e a Tecnologia, I.P. (FCT) under projects UID/50021/2025 (DOI: https://doi.org/10.54499/UID/50021/

  14. [14]

    and UID/PRR/50021/2025 (DOI:https:// doi.org/10.54499/UID/PRR/50021/2025) and by the Portuguese Recovery and Resilience Plan and NextGenerationEU European Union funds under project C644865762-00000008 (ACCELERAT.AI)

  15. [15]

    Céu Viana

    Bibliographical References Alberto Abad, Isabel Trancoso, Nelson Neto, and M. Céu Viana. 2009. Porting an european por- tuguese broadcast news recognition system to brazilianportuguese. InInterspeech 2009,pages 92–95. José Aires, Aida Cardoso, Rui Pereira, and Amália Mendes. 2024. Compiling and exploring a Por- tuguese parliamentary corpus: ParlaMint-PT. ...

  16. [16]

    MuAViC: A Multilingual Audio-Visual Cor- pus for Robust Speech Recognition and Robust Speech-to-Text Translation. InProc. Interspeech, pages 4064–4068. Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Ty- ers, andGregorWeber.2020. CommonVoice: A Massively-Multilingual Speech ...

  17. [17]

    MOSEL: 950,000 hours of speech data for open-source speech foundation model train- ing on EU languages. InProc. EMNLP, pages 13934–13947, Miami, Florida, USA. Association for Computational Linguistics. Gonçal Garcés Díaz-Munío, Joan Albert Sil- vestre Cerdà, Javier Jorge-Cano, Adrián Giménez Pastor, Javier Iranzo-Sánchez, Pau Baquero-Arnal, Nahuel Roselló...

  18. [18]

    A comparative study on transformer vs RNN in speech applications. InProc. ASRU. Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, and Shinji Watan- abe. 2023. E-Branchformer: Branchformer with enhanced merging for speech recognition. In Proc. SL T. Andreas Kirkedal, Marija Stepanović, and Barbara Plank. 2020. FT Speech: Danish p...

  19. [19]

    InInternational Con- ference on Speech and Computer, pages 137–

    The parlaspeech collection of automati- cally generated speech and text datasets from parliamentary proceedings. InInternational Con- ference on Speech and Computer, pages 137–

  20. [20]

    MariaHelenaMateusandErnestod’Andrade.2000

    Springer. MariaHelenaMateusandErnestod’Andrade.2000. The Phonology Of Portuguese. OxfordUniversity Press. Hugo Meinedo, Nuno. Souto, and João P. Neto

  21. [21]

    Speech recognition of broadcast news for the European Portuguese language. InProc. ASRU, pages 319–322. Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Chris- tian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu,LarsMaaløe,TaraN.Sainath,andShinji Watanabe. 2022. Self-supervised speech repre- sentation learning: A revie...

  22. [22]

    In5th International Conference on Spo- ken Language Processing (ICSLP 1998), page paper 0562

    A large vocabulary continuous speech recognition hybrid system for the portuguese lan- guage. In5th International Conference on Spo- ken Language Processing (ICSLP 1998), page paper 0562. João P. Neto, Ciro Martins, Hugo Meinedo, and Luis B Almeida. 1997. The design of a large vocabulary speech corpus for Portuguese. In Proc. Eurospeech, pages 1707–1710. ...

  23. [23]

    Robust speech recognition via large-scale weak supervision. InProc. ICML. DimaRekesh, NithinRaoKoluguri, SamuelKriman, SomshubraMajumdar,VahidNoroozi,HeHuang, OleksiiHrinchuk,KrishnaPuvvada,AnkurKumar, Jagadeesh Balam, and Boris Ginsburg. 2023. Fast conformer with linearly scalable attention for efficient speech recognition. InProc. ASRU, pages 1–8. Jean-...

  24. [24]

    Google USM: Scaling Au- tomatic Speech Recognition Beyond 100 Languages,

    Identification of common molecular subsequences.Journal of Molecular Biology, 147(1):195–197. Per Erik Solberg and Pablo Ortiz. 2022. The Nor- wegian parliamentary speech corpus. InProc. LREC, pages 1003–1008, Marseille, France. Eu- ropean Language Resources Association. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. RoFo...