arxiv: 2604.21897 · v1 · submitted 2026-04-23 · 💻 cs.CL · cs.CY

Recognition: unknown

Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach

Fl\'avio Soriano , Victoria F. Mello , Pedro B. Rigueira , Gisele L. Pappa , Wagner Meira Jr. , Ana Paula Couto da Silva , Jussara M. Almeida

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:32 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords parliamentary discoursestylometric analysistopic modelingsemantic clusteringBrazilian politicslegislative behaviorcomputational social sciencepolitical speech

0 comments

The pith

Computational analysis of over 450,000 Brazilian parliamentary speeches shows a shift to shorter rhetoric, crisis-driven agenda changes, and alignments based more on region and gender than party.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a computational framework to study parliamentary discourse along three dimensions: how things are said, what is said, and which speakers align discursively. It applies this to a large corpus of Brazilian Chamber of Deputies speeches spanning 2003 to 2025, tracking stylistic evolution, topic shifts, and semantic groupings. Results indicate speeches have grown shorter and more direct, legislative topics reorient sharply with national crises, and regional or gender identities often outweigh formal party ties in determining who speaks similarly. This approach matters because standard legislative studies rely primarily on voting records and therefore miss the semantic and rhetorical content that shapes actual lawmaking. The work positions discourse analysis as a multidimensional complement to vote-based methods.

Core claim

The authors establish that a scalable framework combining diachronic stylometric analysis, contextual topic modeling, and semantic clustering applied to over 450,000 speeches from the Brazilian Chamber of Deputies uncovers a long-term stylistic shift toward shorter and more direct speeches, a legislative agenda that reorients sharply in response to national crises, and a granular map of discursive alignments in which regional and gender identities often prove more salient than formal party affiliation.

What carries the argument

A scalable computational framework that combines diachronic stylometric analysis to track changes in speech style, contextual topic modeling to capture agenda shifts, and semantic clustering to map discursive similarities among deputies.

Load-bearing premise

The diachronic stylometric analysis, contextual topic modeling, and semantic clustering accurately capture rhetorical and semantic content without substantial bias from model choices, preprocessing, or corpus selection.

What would settle it

If an independent manual annotation of a random sample of the speeches or an alternative set of models and preprocessing choices fails to recover the same long-term shortening trend, crisis-linked topic reorientations, and region/gender-dominant clusters, the central results would not hold.

Figures

Figures reproduced from arXiv: 2604.21897 by Ana Paula Couto da Silva, Fl\'avio Soriano, Gisele L. Pappa, Jussara M. Almeida, Pedro B. Rigueira, Victoria F. Mello, Wagner Meira Jr..

**Figure 1.** Figure 1: The multi-faceted analytical pipeline, from data retrieval to the three parallel analysis fronts: thematic, semantic, and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Evolution of stylometric (top row) and grammatical (bottom row) metrics of parliamentary discourse (2003-2024). [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Temporal evolution of the discourse for 6 macro-themes (2003-2025). [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: A timeline visualizing the dominant theme in parliamentary discourse (2003-2025). [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: 2D visualization of the centroids of the 49 iden [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Analyses of legislative behavior often rely on voting records, overlooking the rich semantic and rhetorical content of political speech. In this paper, we ask three complementary questions about parliamentary discourse: how things are said, what is being said, and who is speaking in discursively similar ways. To answer these questions, we introduce a scalable and generalizable computational framework that combines diachronic stylometric analysis, contextual topic modeling, and semantic clustering of deputies' speeches. We apply this framework to a large-scale case study of the Brazilian Chamber of Deputies, using a corpus of over 450,000 speeches from 2003 to 2025. Our results show a long-term stylistic shift toward shorter and more direct speeches, a legislative agenda that reorients sharply in response to national crises, and a granular map of discursive alignments in which regional and gender identities often prove more salient than formal party affiliation. More broadly, this work offers a robust methodology for analyzing parliamentary discourse as a multidimensional phenomenon that complements traditional vote-based approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard NLP tools applied at scale to Brazilian legislative speeches yield plausible patterns on style, crises, and identity, but add little methodological innovation.

read the letter

The paper takes three standard NLP approaches—stylometric tracking of speech length and directness, contextual topic modeling, and embedding-based clustering—and runs them on a large corpus of Brazilian deputy speeches. It reports a trend toward briefer, plainer language over two decades, topic shifts aligned with crises like economic downturns or pandemics, and clusters where regional and gender factors often group speakers more tightly than party membership.

Referee Report

0 major / 2 minor

Summary. The paper introduces a scalable computational framework combining diachronic stylometric analysis, contextual topic modeling, and semantic clustering to examine parliamentary discourse. Applied to a corpus of over 450,000 speeches from the Brazilian Chamber of Deputies (2003–2025), it claims to document a long-term shift toward shorter and more direct speeches, sharp reorientations in the legislative agenda during national crises, and a map of discursive alignments in which regional and gender identities are often more salient than formal party affiliation.

Significance. If the results hold, this work is significant for offering a multi-dimensional complement to vote-based legislative studies by incorporating rhetorical and semantic content at scale. The use of a large corpus, standard documented pipelines, and basic validation steps such as coherence scores and silhouette metrics are explicit strengths that support generalizability to other parliaments. The findings on identity saliency and crisis responsiveness could inform computational political science and encourage similar multi-faceted analyses elsewhere.

minor comments (2)

The abstract states the headline results without referencing the validation metrics (coherence scores, silhouette metrics) used in the three core analyses; a brief clause noting these steps would improve transparency and align the summary with the methods section.
Several figures depicting topic evolution over time and the embedding-based clusters would benefit from additional axis labels, legends, or annotations to enhance clarity and allow readers to assess the claimed patterns without ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive summary of our manuscript, as well as the recommendation for minor revision. We are pleased that the significance of the multi-faceted computational framework, the scale of the 450k-speech corpus, and the key findings on stylistic simplification, crisis-driven agenda shifts, and the relative salience of regional/gender identities over party affiliation have been recognized. We will make the necessary minor revisions to improve clarity and presentation.

Circularity Check

0 steps flagged

No significant circularity; empirical application of standard tools

full rationale

The manuscript presents an empirical case study applying three established, externally documented NLP pipelines (diachronic stylometrics, contextual topic modeling, and embedding-based semantic clustering) to an independent corpus of 450k+ speeches. No equations, parameters, or predictions are derived from the target results themselves; all methods are standard (with reported coherence/silhouette validation) and the claims rest on observable patterns in the data rather than self-referential definitions or fitted quantities renamed as predictions. Self-citations, if any, are non-load-bearing and do not substitute for the core analyses.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the framework implicitly assumes standard NLP preprocessing and model validity for political text.

pith-pipeline@v0.9.0 · 5512 in / 1103 out tokens · 82458 ms · 2026-05-09T21:32:25.087221+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 2 canonical work pages · 2 internal anchors

[1]

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Optuna: A Next-generation Hyperparameter Opti- mization Framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discov- ery and Data Mining. Baptista, L.; Mooney, J.; and de Faria, P. 2021. The political- ideological spectrum in the Brazilian Senate: A text analy- sis of senators’ speeches (2015-2018).PLOS ONE, 16(6): e0251MT...

work page internal anchor Pith review arXiv 2021
[2]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Congressional-record: A parser for the Congressional Record. Kim, J.; Lee, S.; Kwon, J.; Gu, S.; Kim, Y .; Cho, M.; yong Sohn, J.; and Choi, C. 2024. Linq-Embed-Mistral:Elevating Text Retrieval with Improved GPT Data Through Task- Specific Control and Quality Refinement. Linq AI Research Blog. Laver, M.; Benoit, K.; and Garry, J. 2003. Extracting Policy P...

work page internal anchor Pith review arXiv 2024
[3]

For most authors... (a) Would answering this research question advance sci- ence without violating social contracts, such as violat- ing privacy norms, perpetuating unfair profiling, exac- erbating the socio-economic divide, or implying disre- spect to societies or cultures? Yes (b) Do your main claims in the abstract and introduction accurately reflect t...
[4]

Additionally, if your study involves hypotheses testing... (a) Did you clearly state the assumptions underlying all theoretical results? Yes (b) Have you provided justifications for all theoretical re- sults? Yes (c) Did you discuss competing hypotheses or theories that might challenge or complement your theoretical re- sults? Yes (d) Have you considered ...
[5]

(a) Did you state the full set of assumptions of all theoret- ical results? NA (b) Did you include complete proofs of all theoretical re- sults? NA

Additionally, if you are including theoretical proofs... (a) Did you state the full set of assumptions of all theoret- ical results? NA (b) Did you include complete proofs of all theoretical re- sults? NA
[6]

Additionally, if you ran machine learning experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (ei- ther in the supplemental material or as a URL)? Yes (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Yes (c) Did you report error bars (...
[7]

(a) If your work uses existing assets, did you cite the cre- ators? Yes (b) Did you mention the license of the assets? No

Additionally, if you are using existing assets (e.g., code, data, models) or curating/releasing new assets,without compromising anonymity... (a) If your work uses existing assets, did you cite the cre- ators? Yes (b) Did you mention the license of the assets? No. The assets used are public records (c) Did you include any new assets in the supplemental mat...
[8]

Additionally, if you used crowdsourcing or conducted research with human subjects,without compromising anonymity... (a) Did you include the full text of instructions given to participants and screenshots? NA (b) Did you describe any potential participant risks, with mentions of Institutional Review Board (IRB) ap- provals? NA (c) Did you include the estim...
[9]

A list of exactly six macro-themes
[10]

1You are assisting with the thematic analysis of parliamentary speeches from the Brazilian Chamber of Deputies

For each macro-theme: 25- label 26- short description Listing 2: Prompt used in Stage 2 for contextual lexical expansion. 1You are assisting with the thematic analysis of parliamentary speeches from the Brazilian Chamber of Deputies. 2 3You will receive:
[11]

A list of macro-themes previously defined for the analysis
[12]

An initial manually defined set of seed keywords for each macro-theme
[13]

A list of granular BERTopic topics with representative keywords. 7 8Task: 9For each macro-theme, expand the initial seed keyword list with additional terms that are contextually related and likely to co-occur with the seed keywords in Brazilian parliamentary discourse. 10 11Instructions: 12- Preserve the meaning and scope of each macro-theme. 13- Suggest ...

2024