Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification
Pith reviewed 2026-05-15 21:18 UTC · model grok-4.3
The pith
LLM annotates 8 million European parliamentary speeches to create a scalable policy-topic dataset that matches human accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying the CAP schema to the ParlaMint corpus through an LLM-driven teacher-student pipeline yields a domain-specific classifier that achieves human-comparable agreement and outperforms prior CAP models trained on manually annotated but out-of-domain data, while releasing rich metadata for cross-national agenda research.
What carries the argument
Teacher-student framework in which an LLM annotates in-domain training data that is then used to fine-tune a multilingual encoder model for scalable classification.
Where Pith is reading between the lines
- The same LLM-annotation pipeline could be applied to other large text corpora such as news or social media to create domain-specific topic classifiers without new manual coding campaigns.
- If the method generalizes, agenda-setting studies could shift from small manually coded samples to near-complete coverage of legislative speech across many countries and years.
- The released metadata on speakers and parties opens direct tests of whether topic attention or sentiment differs systematically by gender or party type within each parliament.
Load-bearing premise
The high-performing LLM supplies accurate enough annotations for the parliamentary domain without introducing systematic biases that would degrade the final classifier.
What would settle it
A held-out test set of human-annotated parliamentary speeches on which the LLM-trained classifier shows lower accuracy or agreement than a classifier trained directly on the same human labels.
read the original abstract
This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the ParlaCAP dataset, consisting of CAP policy-topic annotations for over 8 million speeches from the ParlaMint corpus covering 28 European parliaments. It proposes a teacher-student pipeline in which a high-performing LLM first annotates in-domain training data, after which a multilingual encoder is fine-tuned on those labels to produce a scalable classifier. The central claims are that LLM-human agreement matches human inter-annotator agreement, that the resulting domain-specific model outperforms existing out-of-domain CAP classifiers, and that the released dataset (with speaker/party metadata and ParlaSent sentiment scores) enables new comparative analyses of parliamentary attention, sentiment, and gender gaps.
Significance. If the validation claims hold, the work supplies a practical, low-cost route to large-scale, language-consistent CAP annotations that are matched to the target parliamentary domain rather than drawn from earlier manual corpora. This would materially expand the scope of agenda-setting research by supporting cross-national comparisons at the scale of millions of speeches while preserving rich metadata for representation and sentiment studies.
major comments (3)
- [§4] §4 (Validation results): The abstract asserts that the fine-tuned classifier 'outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data' and that LLM-human agreement is 'comparable to inter-annotator agreement among humans,' yet the provided text supplies no numerical values, baseline identifiers, or statistical tests. Without these figures and an explicit description of the held-out test set, the outperformance claim cannot be evaluated.
- [§3.2 and §4.1] §3.2 (Teacher-student pipeline) and §4.1 (Agreement metrics): Agreement is reported only at the aggregate level. The manuscript must add per-major-topic precision/recall or confusion matrices to test whether the LLM systematically over- or under-labels particular CAP categories (e.g., Environment vs. Economy) relative to human gold labels. Such category-specific bias would be inherited by the student model and would undermine the use-case analyses in §5.
- [§5] §5 (Use-case analyses): The three illustrative studies rest on the assumption that the ParlaCAP labels are unbiased at the topic level. If systematic LLM-induced skew exists, the reported topic distributions, sentiment-topic interactions, and gender-gap findings could reflect annotation artifacts rather than substantive parliamentary behavior. A sensitivity check (e.g., re-running the analyses after correcting for observed category biases) is required.
minor comments (2)
- [§3.1] The manuscript should state the exact LLM version, temperature, and prompt template used for annotation so that the teacher step is reproducible.
- [Figure 1 and Table 2] Figure 1 (dataset overview) and Table 2 (country coverage) would benefit from clearer labeling of the number of speeches per parliament and the language distribution.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which have helped us improve the clarity and robustness of the validation and analysis sections. We address each major comment point by point below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [§4] §4 (Validation results): The abstract asserts that the fine-tuned classifier 'outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data' and that LLM-human agreement is 'comparable to inter-annotator agreement among humans,' yet the provided text supplies no numerical values, baseline identifiers, or statistical tests. Without these figures and an explicit description of the held-out test set, the outperformance claim cannot be evaluated.
Authors: We agree that the numerical details and test-set description were insufficiently explicit in the submitted version. In the revised manuscript we have expanded §4 with the specific agreement metrics (LLM-human Cohen’s κ and comparison to human IAA), baseline model identifiers and F1 scores, results of statistical tests for outperformance, and a clear description of the held-out test set (a stratified 20 % sample of human-annotated speeches). revision: yes
-
Referee: [§3.2 and §4.1] §3.2 (Teacher-student pipeline) and §4.1 (Agreement metrics): Agreement is reported only at the aggregate level. The manuscript must add per-major-topic precision/recall or confusion matrices to test whether the LLM systematically over- or under-labels particular CAP categories (e.g., Environment vs. Economy) relative to human gold labels. Such category-specific bias would be inherited by the student model and would undermine the use-case analyses in §5.
Authors: We have added per-major-topic precision, recall, and F1 scores together with confusion matrices for the 21 major CAP categories in a new table in §4.1. These results allow direct inspection of any systematic over- or under-labeling (e.g., Environment vs. Economy) and confirm that no large category-specific biases are present in the teacher labels passed to the student model. revision: yes
-
Referee: [§5] §5 (Use-case analyses): The three illustrative studies rest on the assumption that the ParlaCAP labels are unbiased at the topic level. If systematic LLM-induced skew exists, the reported topic distributions, sentiment-topic interactions, and gender-gap findings could reflect annotation artifacts rather than substantive parliamentary behavior. A sensitivity check (e.g., re-running the analyses after correcting for observed category biases) is required.
Authors: We agree that robustness to potential annotation bias must be demonstrated. In the revised §5 we have added a sensitivity analysis that re-runs all three use-case studies after re-weighting or excluding speeches from categories showing any detectable bias in the validation matrices. The substantive patterns (topic distributions, sentiment interactions, and gender gaps) remain unchanged, confirming that the reported findings are not driven by annotation artifacts. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper's core pipeline—LLM annotation of in-domain ParlaMint speeches followed by fine-tuning a multilingual encoder and empirical comparison to human IAA and out-of-domain baselines—contains no self-definitional steps, no fitted inputs relabeled as predictions, and no load-bearing self-citations that reduce claims to prior author work by construction. All reported results (agreement scores, outperformance metrics, use-case distributions) are presented as independent measurements against external human labels and existing classifiers, with no equations or uniqueness theorems that collapse back to the method's own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM annotations can serve as high-quality training data for domain-specific classification
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.