pith. sign in

arxiv: 2604.03247 · v1 · submitted 2026-03-10 · 💻 cs.CY · cs.AI· cs.CL· cs.SI

Classifying Problem and Solution Framing in Congressional Social Media

Pith reviewed 2026-05-15 13:38 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLcs.SI
keywords problem framingsolution framingcongressional tweetsBERTweettext classificationGarbage Can modelUS Senatorssocial media analysis
0
0 comments X

The pith

A BERTweet model classifies US Senators' tweets as problem-focused, solution-focused, or other with weighted F1 above 0.8.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Researchers trained a machine learning model to automatically label tweets from US Senators as focusing on problems, solutions, or neither. This draws on the Garbage Can model of policy making, which separates problem identification from solution processes. Experts labeled 3,967 tweets from a larger set of 1.68 million, and a BERTweet model was fine-tuned on this data. The model reached an average weighted F1 score above 0.8 in cross-validation across the three categories. This automated approach makes it possible to analyze framing patterns across the entire corpus of congressional social media posts.

Core claim

Using supervised learning on a labeled subset of 3967 tweets, a BERTweet Base model classifies US Senator tweets into problem, solution, or other categories with an average weighted F1 score above 0.8 on cross validation.

What carries the argument

A fine-tuned BERTweet Base model for three-class text classification of tweets into problem, solution, or other.

If this is right

  • The full 1.68 million tweets can now be labeled at scale to measure overall problem versus solution emphasis in senatorial posts.
  • Temporal tracking becomes possible to observe shifts in framing around legislative events or elections.
  • Comparisons across individual senators or political parties can quantify differences in their use of problem and solution language.
  • The classifier supports direct tests of the Garbage Can model's problem-solution distinction in digital policy communication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same approach could extend to other platforms or elected officials to compare framing styles across contexts.
  • Pairing the labels with topic models might show which problems are most often linked to specific solutions in the tweets.
  • If the classifier holds, it could enable studies linking tweet framing to public engagement metrics like replies or retweets.

Load-bearing premise

Expert labels on the 3967 tweets are accurate and representative of the full 1.68 million tweet corpus without major annotation errors or selection bias.

What would settle it

Labeling a new independent sample of 500 tweets from the same corpus with the same expert criteria and finding the model's weighted F1 on that sample falls below 0.7.

Figures

Figures reproduced from arXiv: 2604.03247 by A. Michael Tjhin, Annelise Russell, Blake VanBerlo, Jesse Hoey, Joshua D. Elkind, Michelle M. Buehlmann, Misha Melnyk, Mitchell Dolny, Saisha Chebium.

Figure 1
Figure 1. Figure 1: Labeled data set partition used for training and evaluation. Size of subsets in graph [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: BERT Model Details showing flow of data from initial text to output scores. In [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Tweets per month by party [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of labels per month for each party. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of labels per month by gender and race. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

Policy setting in the USA according to the ``Garbage Can'' model differentiates between ``problem'' and ``solution'' focused processes. In this paper, we study a large dataset of US Senator postings on Twitter (1.68m tweets in total). Our objective is to develop an automated method to label Senatorial posts as either in the problem or solution streams. Two academic policy experts labeled a subset of 3967 tweets as either problem, solution, or other (anything not problem or solution). We split off a subset of 500 tweets into a test set, with the remaining 3467 used for training. During development, this training set was further split by 60/20/20 proportions for fitting, validation, and development test sets. We investigated supervised learning methods for building problem/solution classifiers directly on the training set, evaluating their performance in terms of F1 score on the validation set, allowing us to rapidly iterate through models and hyperparameters, achieving an average weighted F1 score of above 0.8 on cross validation across the three categories using a BERTweet Base model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to classify US Senators' tweets (from a 1.68m corpus) into problem, solution, or other framing categories per the Garbage Can model of policy processes. Two experts labeled a 3967-tweet subset; a BERTweet Base model fine-tuned on 3467 tweets (via 60/20/20 splits) achieves average weighted F1 >0.8 on cross-validation, with a 500-tweet held-out test set.

Significance. If the expert labels prove reliable, the work offers a scalable tool for large-scale analysis of problem versus solution framing in congressional social media, directly supporting empirical tests of policy stream models. The reported F1 performance on a transformer baseline is a concrete, reproducible starting point for such studies.

major comments (2)
  1. [Data and Annotation] Data and Annotation section: No inter-annotator agreement statistic (e.g., Cohen's kappa or percentage agreement) is reported for the two experts' labels on the 3967 tweets, nor is there any description of how disagreements were adjudicated. Because the weighted F1 >0.8 claim is computed directly against these labels, the absence of reliability metrics is load-bearing for the central performance result.
  2. [Data and Annotation] Data and Annotation section: The manuscript provides no details on how the 3967 tweets were sampled from the full 1.68m corpus (e.g., random, stratified, keyword-based). Without this, it is impossible to evaluate selection bias or representativeness, which directly affects whether the F1 score generalizes to the full dataset.
minor comments (1)
  1. [Abstract] Abstract: The description of the train/validation/test split ('60/20/20 proportions for fitting, validation, and development test sets') should be clarified as to whether cross-validation refers to k-fold or repeated hold-out, to avoid ambiguity with the reported 'cross validation' F1.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of transparency in our data and annotation process. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Data and Annotation] Data and Annotation section: No inter-annotator agreement statistic (e.g., Cohen's kappa or percentage agreement) is reported for the two experts' labels on the 3967 tweets, nor is there any description of how disagreements were adjudicated. Because the weighted F1 >0.8 claim is computed directly against these labels, the absence of reliability metrics is load-bearing for the central performance result.

    Authors: We agree that inter-annotator agreement metrics are essential to substantiate label reliability. The two experts labeled the 3967 tweets independently, with disagreements resolved through discussion to reach consensus. We will add Cohen's kappa, percentage agreement, and a description of the adjudication process to the revised Data and Annotation section. revision: yes

  2. Referee: [Data and Annotation] Data and Annotation section: The manuscript provides no details on how the 3967 tweets were sampled from the full 1.68m corpus (e.g., random, stratified, keyword-based). Without this, it is impossible to evaluate selection bias or representativeness, which directly affects whether the F1 score generalizes to the full dataset.

    Authors: The 3967 tweets were obtained via random sampling from the full 1.68m corpus to support representativeness. We will explicitly describe this sampling procedure in the revised Data and Annotation section to allow evaluation of potential bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the supervised classification pipeline

full rationale

The paper trains a BERTweet model on expert-provided labels for 3967 tweets and reports weighted F1 >0.8 on cross-validation splits of that data plus a held-out test set. The performance metric is computed directly against the independent human labels on unseen examples; it is not reduced to the training inputs by construction, nor does any equation or self-citation chain equate the reported F1 to a fitted parameter or prior result. The derivation remains self-contained as a standard supervised evaluation against external annotations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that expert annotations provide reliable ground truth and that the pre-trained BERTweet model can be fine-tuned effectively on this domain-specific text.

free parameters (1)
  • BERTweet fine-tuning hyperparameters
    Learning rate, batch size, and epochs are chosen during training on the labeled tweets.
axioms (1)
  • domain assumption Expert labels on 3967 tweets are accurate ground truth for problem/solution/other categories
    Invoked when training the supervised classifier on the labeled subset.

pith-pipeline@v0.9.0 · 5530 in / 1081 out tokens · 37832 ms · 2026-05-15T13:38:34.194013+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    Bishop, C. M. (2006). Pattern Recognition and Machine Learning . Springer

  2. [2]

    D., March, J

    Cohen, M. D., March, J. G., and Olsen, J. P. (1972). A garbage can model of organizational choice. Administrative Science Quarterly , 17(1):1--25

  3. [3]

    Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR , abs/1810.04805

  4. [4]

    Frisli, S. (2025). Semi-supervised self-training for covid-19 misinformation detection: analyzing twitter data and alternative news media on norwegian twitter. Journal of Computational Social Science , 8(39)

  5. [5]

    Kingdon, J. W. (2003). Agendas, Alternatives, and Public Policies . Addison-Wesley

  6. [6]

    Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  7. [7]

    and Koch, G

    Landis, J. and Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics , 33(1):159--174

  8. [8]

    Decoupled Weight Decay Regularization

    Loshchilov, I. and Hutter, F. (2017). Fixing weight decay regularization in adam. CoRR , abs/1711.05101

  9. [9]

    and Gurevych, I

    Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics

  10. [10]

    T. Chen, T. and C. Guestrin, C. (2016). Xgboost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , page 785–794