pith. sign in

arxiv: 2605.15886 · v1 · pith:C57JNEOYnew · submitted 2026-05-15 · 💻 cs.CL

Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

Pith reviewed 2026-05-20 18:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords Russian politicspolitical speechesmultimodal datasettopic modelingmultilingual corpuspolitical communicationauthoritarian regimesdata resource
0
0 comments X

The pith

A new dataset links decades of Russian government speeches to images, translations, and topics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents two large collections of official speeches from the Kremlin and Russian Ministry of Foreign Affairs, each with Russian and English texts, available images and captions, and harmonized details such as dates, speakers, and locations. Unique identifiers connect the images to the speeches and match the two language versions of each text. Topical labels for both the words and the pictures were created with transformer models and checked by a specialist in Russian politics. A sympathetic reader would care because the resource fills a longstanding shortage of structured, multimodal data on how authoritarian states communicate and supplies a ready testbed for studying political language with both traditional methods and large language models.

Core claim

This paper introduces a dataset of interlinked multimodal political communications from the Russian government. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech it supplies Russian- and English-language texts, associated images and captions where available, and harmonized metadata including dates, speakers, locations, and official tags. Unique identifiers link images to speeches and align the Russian and English versions. The collections are further augmented with validated topical annotations for both speech texts and speech images, generated via theg

What carries the argument

Unique identifiers that link images to specific speeches while aligning Russian and English versions, combined with transformer-generated multimodal topic annotations refined by expert review.

If this is right

  • Enables combined analysis of textual content and visual elements in the same communications.
  • Supports direct comparison of Russian and English versions of official statements.
  • Allows tracking of themes across time and geographic locations in domestic and foreign policy.
  • Supplies a ready testbed for applying large language models to real political texts and images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linking approach could be reused to build comparable resources for speeches from other governments.
  • Differences in how topics appear in text versus images might reveal strategies for shaping domestic versus international audiences.
  • The dataset could be used to test whether models trained on it better detect shifts in official messaging during key events.

Load-bearing premise

The topical annotations and the links between images, speeches, and language versions are accurate and reliable.

What would settle it

A spot-check that finds many mismatched image-speech pairs or topic labels that systematically disagree with independent expert judgment would show the dataset cannot reliably support the claimed analyses.

Figures

Figures reproduced from arXiv: 2605.15886 by Benjamin E. Bagozzi, Daria Blinova, Gayathri Emuru, Kushagradheer Shridheer Srivastava, Mina Rulis, Rakesh Emuru, Sunita Chandrasekaran.

Figure 1
Figure 1. Figure 1: Two-stage webscraping workflow. For each source (Kremlin, MID) and language (Russian, English), an index builder first traverses the site listings and writes an index CSV of speech IDs and URLs. A page fetcher-parser then consumes this index, downloads each page, extracts structured text and metadata into a speech-level CSV, and saves all associated images into per-ID folders. 9/64 [PITH_FULL_IMAGE:figure… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-lingual linkage within a source. For each source (Kremlin or MID), Russian [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Kremlin corpus: coverage and basic descriptive statistics for speeches and images across Russian- and English-language versions of the site. Figures 3a–3d summarize the coverage and basic structure of our scraped Kremlin speech corpus. 16/64 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MID.RU corpus: coverage and basic descriptive statistics for speeches and images [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: High-level overview of the topic modeling pipeline. We embed Kremlin and MID speech texts (EN and RU→EN) with sentence-transformer models and fit BERTopic48 separately per corpus. In parallel, associated images are embedded with CLIP16 (ViT-B/3216) and scored against topic prompts to assign image-topic labels. Final curated topic IDs, labels, and groups are saved for speeches and images. 29/64 [PITH_FULL_… view at source ↗
Figure 6
Figure 6. Figure 6: K-sweep scree plot for the Kremlin English corpus. Lines show normalized topic-quality metrics (coherence cnpmi, diversity, compactness, and separation) and their weighted composite score; the selected solution is K = 89 topics. 33/64 [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: K-sweep scree plot for the MID English corpus. Lines show normalized topic-quality metrics (coherence cnpmi, diversity, compactness, and separation) and their weighted composite score; the selected solution is K = 32 topics. To select these target topic counts, we first performed model-selection diagnostics on the native-English corpora (Kremlin EN and MID EN). For each corpus, we began from a single high-… view at source ↗
read the original abstract

This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics contexts. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech, we provide Russian- and English-language texts, associated images and captions where available, and harmonized metadata including (e.g.) dates, speakers, (geo)locations, and official government content tags. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of (authoritarian) political communication and offer a valuable testbed for social science research and large language model (LLM) applications in political domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces a dataset of interlinked multimodal political communications from the Russian government, comprising two large corpora of official speeches by senior Kremlin and Ministry of Foreign Affairs actors over multiple decades. For each speech it provides Russian- and English-language texts, associated images and captions where available, harmonized metadata (dates, speakers, locations, official tags), unique identifiers linking images to speeches and aligning language versions, and topical annotations for both texts and images generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The authors claim the resulting resources support multimodal, multilingual, temporal, and spatial analyses of authoritarian political communication and offer a valuable testbed for social science research and LLM applications in political domains.

Significance. If the linking procedures and topical annotations prove reliable, the dataset would address a genuine gap in available multimodal and multilingual data for authoritarian politics contexts and could enable new empirical work on political communication as well as serve as a testbed for LLM evaluation in domain-specific settings. The provision of harmonized metadata and cross-language alignment is a concrete strength that would facilitate temporal and spatial analyses.

major comments (2)
  1. The manuscript provides no quantitative validation for the transformer-based multimodal topic annotations (e.g., topic coherence scores, held-out perplexity, or inter-rater agreement between model output and expert refinements) and no details on model architecture, multimodal alignment procedure, or training regime. This directly weakens the central claim that the annotations are accurate and reliable enough to support the asserted analyses and testbed uses (see Abstract and Dataset Description sections).
  2. No information is given on data collection procedures, potential selection biases in speech or image inclusion, or error rates in the expert refinement process. These omissions make it impossible to assess whether the described resources actually support the claimed uses for social science and LLM research.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify how the manuscript can better support the dataset's intended uses in political communication research and LLM applications. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: The manuscript provides no quantitative validation for the transformer-based multimodal topic annotations (e.g., topic coherence scores, held-out perplexity, or inter-rater agreement between model output and expert refinements) and no details on model architecture, multimodal alignment procedure, or training regime. This directly weakens the central claim that the annotations are accurate and reliable enough to support the asserted analyses and testbed uses (see Abstract and Dataset Description sections).

    Authors: We agree that the current version lacks explicit quantitative validation metrics and technical details on the modeling pipeline. The annotations were generated with a standard transformer-based multimodal topic model followed by expert refinement from a Russian politics specialist. In the revised manuscript we will add a new subsection under Dataset Description that specifies the model family and architecture, the procedure used for multimodal alignment of text and image features, training regime and hyperparameters, and available quantitative diagnostics such as topic coherence scores. We will also report the criteria and scope of the expert refinement step. This addition will directly address the concern about reliability for downstream analyses. revision: yes

  2. Referee: No information is given on data collection procedures, potential selection biases in speech or image inclusion, or error rates in the expert refinement process. These omissions make it impossible to assess whether the described resources actually support the claimed uses for social science and LLM research.

    Authors: We accept that the manuscript would benefit from greater transparency on these points. The speeches and images were harvested from official Russian government portals and archives covering the stated time period and actors; we will expand the Data Collection subsection to describe the scraping and filtering pipeline, the criteria for inclusion of images and captions, and a discussion of likely selection biases (for example, the emphasis on publicly released official content). For the expert refinement step we will document the protocol used and note that formal error-rate quantification was not performed; instead we will describe the qualitative checks applied and any remaining limitations. These clarifications will be incorporated in the revised version. revision: yes

Circularity Check

0 steps flagged

Dataset introduction paper with no derivations, predictions, or modeling claims exhibits no circularity.

full rationale

This paper introduces a linked multimodal dataset of Russian government speeches, including texts, images, metadata, and topical annotations generated via transformer-based multimodal topic modeling then refined by an expert. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text or abstract. The central claim is simply that the resulting resource supports various analyses and serves as a testbed; this does not reduce to any input by construction, self-citation chain, or renamed known result. The contribution is self-contained as a data release rather than a closed-form result or statistical prediction that could be circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a data resource paper whose contribution rests on collection, linking, and annotation of government communications rather than new theoretical constructs; the primary assumptions concern the representativeness of the collected speeches and the validity of the AI-generated plus expert-refined annotations.

axioms (1)
  • domain assumption Transformer-based multimodal topic modeling produces accurate topical annotations for political speech texts and images when refined by a domain expert.
    Invoked to augment the linked datasets with validated topical annotations for both texts and images.

pith-pipeline@v0.9.0 · 5737 in / 1269 out tokens · 67460 ms · 2026-05-20T18:39:32.957697+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 3 internal anchors

  1. [1]

    & Stewart, B

    Grimmer, J. & Stewart, B. M. Text as data: The promise and pitfalls of automatic content analysis methods for political texts.Polit. Analysis21, 267–297 (2013)

  2. [2]

    & Casas, A

    Wilkerson, J. & Casas, A. Large-scale computerized text analysis in political science: Opportunities and challenges. Annu. Rev. Polit. Sci.20, 529–544 (2017)

  3. [3]

    & Rauh, C

    Mueller, H. & Rauh, C. Reading between the lines: Prediction of political violence using newspaper text.Am. Polit. Sci. Rev.112, 358–375 (2018)

  4. [4]

    & Spirling, A

    Benoit, K., Munger, K. & Spirling, A. Measuring and explaining political sophistication through textual complexity. Am. J. Polit. Sci.63, 491–508 (2019). 59/64

  5. [5]

    A framework for the unsupervised and semi-supervised analysis of visual frames.Polit

    Torres, M. A framework for the unsupervised and semi-supervised analysis of visual frames.Polit. Analysis32, 199–220 (2024)

  6. [6]

    Race, legislative speech, and symbolic representation in congress.Am

    Vishwanath, A. Race, legislative speech, and symbolic representation in congress.Am. J. Polit. Sci.69, 578–593 (2025). 7.Steinert-Threlkeld, Z. C. The future of event data is images.Sociol. Methodol.49, 68–75 (2019)

  7. [7]

    & Nelson, L

    Bonikowski, B. & Nelson, L. K. From ends to means: The promise of computational text analysis for theoretically driven sociological research.Sociol. Methods & Res.51, 1469–1483 (2022)

  8. [8]

    & Williams, N

    Casas, A. & Williams, N. W. Introduction to the special issue on images as data.Comput. Commun. Res.4(2022)

  9. [9]

    Birkenmaier, L., Lechner, C. M. & Wagner, C. The search for solid ground in text as data: A systematic review of validation practices and practical recommendations for validation.Commun. methods measures18, 249–277 (2024)

  10. [10]

    & Zhang, N

    Li, H. & Zhang, N. Computer vision models for image analysis in advertising research.J. Advert.53, 771–790 (2024)

  11. [11]

    & Allmendinger, R

    Shahgholian, A., Odacioglu, E., Zhang, L. & Allmendinger, R. Big textual data research for operations management: Topic modeling with grounded theory.Int. J. Oper. Prod. Manag.(2023)

  12. [12]

    & Wentura, D

    Paulus, A., Rohr, M., Dotsch, R. & Wentura, D. Positive feeling, negative meaning: Visualizing the mental representations of in-group and out-group smiles.PloS one11, e0151230 (2016)

  13. [13]

    & Fischer, A

    Bittermann, A. & Fischer, A. Natural language processing in psychology.Zeitschrift für Psychol.232, 143–146, 10.1027/2151-2604/a000568 (2024)

  14. [14]

    InProceedings of the European Conference on Computer Vision (ECCV)(2018)

    Mahajan, D.et al.Exploring the limits of weakly supervised pretraining. InProceedings of the European Conference on Computer Vision (ECCV)(2018)

  15. [15]

    In Meila, M

    Radford, A.et al.Learning transferable visual models from natural language supervision. In Meila, M. & Zhang, T. (eds.)Proceedings of the 38th International Conference on Machine Learning, vol. 139 ofProceedings of Machine Learning Research, 8748–8763 (PMLR, 2021)

  16. [16]

    InInternational Conference on Machine Learning, 38728–38748 (PMLR, 2023)

    Xu, H.et al.mplug-2: A modularized multi-modal foundation model across text, image and video. InInternational Conference on Machine Learning, 38728–38748 (PMLR, 2023)

  17. [17]

    & van Leeuwen, T.Multimodal Discourse: The Modes and Media of Contemporary Communication (Arnold, London, 2001)

    Kress, G. & van Leeuwen, T.Multimodal Discourse: The Modes and Media of Contemporary Communication (Arnold, London, 2001)

  18. [18]

    & Morency, L.-P

    Baltrušaitis, T., Ahuja, C. & Morency, L.-P. Multimodal machine learning: A survey and taxonomy.IEEE Transactions on Pattern Analysis Mach. Intell.41, 423–443 (2019)

  19. [19]

    J.Capturing Political Communication Online Using Image and Text Data: A Deep Learning Approach

    Pineda, A. J.Capturing Political Communication Online Using Image and Text Data: A Deep Learning Approach. Ph.D. thesis, The University of Michigan, Ann Arbor, MI (2023). 10.7302/7501. Doctoral dissertation in Political Science and Scientific Computing

  20. [20]

    & Shao, L

    Liu, D. & Shao, L. Nationalist propaganda and support for war in an authoritarian context: Evidence from china.J. Peace Res.61, 985–1001 (2024)

  21. [21]

    R., Rosendorff, B

    Hollyer, J. R., Rosendorff, B. P. & Vreeland, J. R. Democracy and transparency.The J. Polit.73, 1191–1205 (2011). 60/64

  22. [22]

    Wallace, J. L. Juking the stats? authoritarian information problems in china.Br. J. Polit. Sci.46, 11–29, 10.1017/S0007123414000106 (2016)

  23. [23]

    & Stukal, D

    Rozenas, A. & Stukal, D. How autocrats manipulate economic news: Evidence from russia’s state-controlled television.The J. Polit.81, 982–996 (2019). 25.Carroll, J. Image and imitation the visual rhetoric of pro-russian propaganda.Ideol. Polit. J.2, 36–79 (2017)

  24. [24]

    E.Censored: Distraction and Diversion Inside China’s Great Firewall(Princeton University Press, 2018)

    Roberts, M. E.Censored: Distraction and Diversion Inside China’s Great Firewall(Princeton University Press, 2018)

  25. [25]

    Chasing the authoritarian spectre: Detecting authoritarian discourse with large language models.Eur

    Mochtak, M. Chasing the authoritarian spectre: Detecting authoritarian discourse with large language models.Eur. J. Polit. Res.(2025)

  26. [26]

    Text-as-data methods to study mass-media manipulations in autocracies.Communist Post-Communist Stud.1–17 (2025)

    La Lova, L. Text-as-data methods to study mass-media manipulations in autocracies.Communist Post-Communist Stud.1–17 (2025)

  27. [27]

    & Zhang, M

    Zhong, W., Chen, B., Liang, F. & Zhang, M. M. Picturing protest: Visual framing in authoritarian media on twitter. Digit. Journalism0, 1–22 (2025)

  28. [29]

    & Berg, E

    Mölder, M. & Berg, E. Conflicts and shifts in the kremlin’s political discourse since the start of the putin presidency (2000–2019).Eur. Stud.75, 564–582 (2023)

  29. [30]

    Priming with fear: Putin’s manipulation of domestic public support.Russ

    Blinova, D. Priming with fear: Putin’s manipulation of domestic public support.Russ. Polit.10, 121–164 (2025)

  30. [31]

    Crises and ideological change in authoritarian regimes: Evidence from the july 2016 coup attempt in turkey.Comp

    Yavuz, M. Crises and ideological change in authoritarian regimes: Evidence from the july 2016 coup attempt in turkey.Comp. Polit. Stud.00104140251369324 (2025)

  31. [32]

    Weiss, J. C. Authoritarian signaling, mass audiences, and nationalist protest in china.Int. Organ.67, 1–35, 10.1017/S0020818312000380 (2013)

  32. [33]

    Weiss, J. C. & Dafoe, A. Authoritarian audiences, rhetoric, and propaganda in international crises: Evidence from china.Int. Stud. Q.63, 963–973 (2019). 36.Dai, Y . & Luqiu, L. R. Wolf warriors and diplomacy in the new era.China Rev.22, 253–283 (2022)

  33. [34]

    & Yao, G

    Liu, M., Yan, J. & Yao, G. Themes and ideologies in china’s diplomatic discourse-a corpus-assisted discourse analysis in china’s official speeches.Front. Psychol.14, 1278240 (2023)

  34. [35]

    & Turcsanyi, R

    Mochtak, M. & Turcsanyi, R. Q. Studying chinese foreign policy narratives: Introducing the ministry of foreign affairs press conferences corpus.J. Chin. Polit. Sci.26, 743–761 (2021). 39.O’Brien, S. P. Anticipating the good, the bad, and the ugly: An early warning approach to conflict and instability analysis.J. conflict resolution46, 791–811 (2002)

  35. [36]

    big data

    Blair, R. A. & Sambanis, N. Forecasting civil wars: Theory and structure in an age of “big data” and machine learning.J. Confl. Resolut.64, 1885–1915 (2020)

  36. [37]

    Conflict forecasting and prediction

    D’Orazio, V . Conflict forecasting and prediction. InOxford Research Encyclopedia of International Studies (Oxford University Press, 2020). 42.Python Software Foundation. Python 3 documentation. https://docs.python.org/3/. Accessed 2026-01-04. 61/64

  37. [38]

    & contributors, R

    Reitz, K. & contributors, R. Requests: Http for humans. https://pypi.org/project/requests/ (2025). Python package. Version 2.32.5 (released Aug 18, 2025). Accessed Jan 4, 2026

  38. [39]

    & Contributors, B

    Richardson, L. & Contributors, B. S. Beautiful soup documentation (software). https://www.crummy.com/softwar e/BeautifulSoup/bs4/doc/. Accessed 2026-01-04

  39. [40]

    Data structures for statistical computing in python

    McKinney, W. Data structures for statistical computing in python. In van der Walt, S. & Millman, J. (eds.) Proceedings of the 9th Python in Science Conference (SciPy 2010), 56–61 (2010)

  40. [41]

    Fips pub 180-4: Secure hash standard (shs)

    National Institute of Standards and Technology. Fips pub 180-4: Secure hash standard (shs). https://csrc.nist.gov/ publications/detail/fips/180/4/final (2015). Accessed 2026-01-04

  41. [42]

    Tech, A. O. & Contributors. Argos translate (software). https://github.com/argosopentech/argos-translate. Accessed 2026-01-04

  42. [43]

    BERTopic: Neural topic modeling with a class-based TF-IDF procedure

    Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794 (2022)

  43. [44]

    Campello, R. J. G. B., Moulavi, D. & Sander, J. Density-based clustering based on hierarchical density estimates. InAdvances in Knowledge Discovery and Data Mining (PAKDD)(2013). 50.Google. Google colaboratory documentation. https://colab.research.google.com/. Accessed 2026-01-04

  44. [45]

    R., Millman, K

    Harris, C. R.et al.Array programming with NumPy.Nature585, 357–362, 10.1038/s41586-020-2649-2 (2020)

  45. [46]

    InAdvances in Neural Information Processing Systems (NeurIPS)(2019)

    Paszke, A.et al.PyTorch: An imperative style, high-performance deep learning library. InAdvances in Neural Information Processing Systems (NeurIPS)(2019)

  46. [47]

    InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations(2020)

    Wolf, T.et al.Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations(2020)

  47. [48]

    & Gurevych, I

    Reimers, N. & Gurevych, I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)(2019)

  48. [49]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    McInnes, L., Healy, J. & Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 (2018)

  49. [50]

    & Manning, C

    Qi, P., Zhang, Y ., Zhang, Y ., Bolton, J. & Manning, C. D. Stanza: A Python natural language processing toolkit for many human languages. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL)(2020)

  50. [51]

    doi:10.5281/zenodo.1212303 , interhash =

    Honnibal, M., Montani, I., Van Landeghem, S. & Boyd, A. spacy: Industrial-strength natural language processing in python, 10.5281/zenodo.1212303 (2020)

  51. [52]

    & Contributors, p

    Korobov, M. & Contributors, p. pymorphy3: Russian morphological analyzer (software). https://github.com/no-p lagiarism/pymorphy3. Accessed 2026-01-04

  52. [53]

    & Contributors, P

    Clark, A. & Contributors, P. Pillow: The friendly PIL fork (software). https://python-pillow.org/. Accessed 2026-01-04

  53. [54]

    & Contributors, j

    Varoquaux, G. & Contributors, j. joblib: Computing with python functions (software). https://joblib.readthedocs.io/. Accessed 2026-01-04

  54. [55]

    Apache parquet: Columnar storage format

    Apache Parquet Contributors. Apache parquet: Columnar storage format. https://parquet.apache.org/. Accessed 2026-01-04. 62/64

  55. [56]

    & Liu, T.-Y

    Song, K., Tan, X., Qin, T., Lu, J. & Liu, T.-Y . MPNet: Masked and permuted pre-training for language understanding. InAdvances in Neural Information Processing Systems (NeurIPS)(2020)

  56. [57]

    Bagozzi, B. E. The multifaceted nature of global climate change negotiations.The Rev. Int. Organ.10, 439–464 (2015)

  57. [58]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Berliner, D., Bagozzi, B. E., Palmer-Rubin, B. & Erlich, A. The political logic of government disclosure: Evidence from information requests in mexico.The J. Polit.83, 229–245 (2021). 65.Wang, L.et al.Text embeddings by weakly-supervised contrastive pre-training. arXiv:2212.03533 (2022)

  58. [59]

    & Wang, W

    Feng, F., Yang, Y ., Cer, D., Arivazhagan, N. & Wang, W. Language-agnostic BERT sentence embedding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL)(2020)

  59. [60]

    BGE-M3: Multilingual, multi-granularity text embeddings (software/model)

    Beijing Academy of Artificial Intelligence (BAAI) and Contributors. BGE-M3: Multilingual, multi-granularity text embeddings (software/model). https://github.com/FlagOpen/FlagEmbedding. Accessed 2026-01-04

  60. [61]

    Claude 3 haiku model documentation (claude-3-haiku-20240307)

    Anthropic. Claude 3 haiku model documentation (claude-3-haiku-20240307). https://docs.anthropic.com/. Accessed 2026-01-04

  61. [62]

    geopy: Geocoding library for python

    geopy contributors. geopy: Geocoding library for python. https://geopy.readthedocs.io/ (2024). Accessed 2025-09-09

  62. [63]

    Nominatim: Openstreetmap geocoding

    OpenStreetMap contributors. Nominatim: Openstreetmap geocoding. https://nominatim.org/ (2024). Accessed 2025-09-09

  63. [64]

    Openstreetmap

    OpenStreetMap contributors. Openstreetmap. https://www.openstreetmap.org (2024). Data and services used via Nominatim; Accessed 2025-09-09

  64. [65]

    Arcgis world geocoding service documentation

    Esri. Arcgis world geocoding service documentation. https://developers.arcgis.com/rest/geocode/api-reference/ove rview-world-geocoding-service.htm. Accessed 2026-01-04. 73.OpenAI. Introducing chatgpt. https://openai.com/index/chatgpt/ (2022). Accessed: 2026-01-05

  65. [66]

    G., Bagozzi, B

    Erlich, A., Dantas, S. G., Bagozzi, B. E., Berliner, D. & Palmer-Rubin, B. Multi-label prediction for political text-as-data.Polit. Analysis30, 463–480, 10.1017/pan.2021.15 (2022)

  66. [67]

    Acknowledgments This work was supported in part by the National Science Foundation under Award No

    Blinova, D.et al.Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches, 10.7910/DVN/SG I0VK (2026). Acknowledgments This work was supported in part by the National Science Foundation under Award No. 2417814, SCIPE: Building a Computational and Data-Intensive Research Workforce & Network in the Mid-Atlantic Region (Strengthening the Cyber...