Classifying Problem and Solution Framing in Congressional Social Media

A. Michael Tjhin; Annelise Russell; Blake VanBerlo; Jesse Hoey; Joshua D. Elkind; Michelle M. Buehlmann; Misha Melnyk; Mitchell Dolny; Saisha Chebium

arxiv: 2604.03247 · v1 · submitted 2026-03-10 · 💻 cs.CY · cs.AI· cs.CL· cs.SI

Classifying Problem and Solution Framing in Congressional Social Media

Misha Melnyk , Mitchell Dolny , Joshua D. Elkind , A. Michael Tjhin , Saisha Chebium , Blake VanBerlo , Annelise Russell , Michelle M. Buehlmann

show 1 more author

Jesse Hoey

This is my paper

Pith reviewed 2026-05-15 13:38 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLcs.SI

keywords problem framingsolution framingcongressional tweetsBERTweettext classificationGarbage Can modelUS Senatorssocial media analysis

0 comments

The pith

A BERTweet model classifies US Senators' tweets as problem-focused, solution-focused, or other with weighted F1 above 0.8.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Researchers trained a machine learning model to automatically label tweets from US Senators as focusing on problems, solutions, or neither. This draws on the Garbage Can model of policy making, which separates problem identification from solution processes. Experts labeled 3,967 tweets from a larger set of 1.68 million, and a BERTweet model was fine-tuned on this data. The model reached an average weighted F1 score above 0.8 in cross-validation across the three categories. This automated approach makes it possible to analyze framing patterns across the entire corpus of congressional social media posts.

Core claim

Using supervised learning on a labeled subset of 3967 tweets, a BERTweet Base model classifies US Senator tweets into problem, solution, or other categories with an average weighted F1 score above 0.8 on cross validation.

What carries the argument

A fine-tuned BERTweet Base model for three-class text classification of tweets into problem, solution, or other.

If this is right

The full 1.68 million tweets can now be labeled at scale to measure overall problem versus solution emphasis in senatorial posts.
Temporal tracking becomes possible to observe shifts in framing around legislative events or elections.
Comparisons across individual senators or political parties can quantify differences in their use of problem and solution language.
The classifier supports direct tests of the Garbage Can model's problem-solution distinction in digital policy communication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same approach could extend to other platforms or elected officials to compare framing styles across contexts.
Pairing the labels with topic models might show which problems are most often linked to specific solutions in the tweets.
If the classifier holds, it could enable studies linking tweet framing to public engagement metrics like replies or retweets.

Load-bearing premise

Expert labels on the 3967 tweets are accurate and representative of the full 1.68 million tweet corpus without major annotation errors or selection bias.

What would settle it

Labeling a new independent sample of 500 tweets from the same corpus with the same expert criteria and finding the model's weighted F1 on that sample falls below 0.7.

Figures

Figures reproduced from arXiv: 2604.03247 by A. Michael Tjhin, Annelise Russell, Blake VanBerlo, Jesse Hoey, Joshua D. Elkind, Michelle M. Buehlmann, Misha Melnyk, Mitchell Dolny, Saisha Chebium.

**Figure 2.** Figure 2: BERT Model Details showing flow of data from initial text to output scores. In [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Tweets per month by party [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of labels per month for each party. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of labels per month by gender and race. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

Policy setting in the USA according to the ``Garbage Can'' model differentiates between ``problem'' and ``solution'' focused processes. In this paper, we study a large dataset of US Senator postings on Twitter (1.68m tweets in total). Our objective is to develop an automated method to label Senatorial posts as either in the problem or solution streams. Two academic policy experts labeled a subset of 3967 tweets as either problem, solution, or other (anything not problem or solution). We split off a subset of 500 tweets into a test set, with the remaining 3467 used for training. During development, this training set was further split by 60/20/20 proportions for fitting, validation, and development test sets. We investigated supervised learning methods for building problem/solution classifiers directly on the training set, evaluating their performance in terms of F1 score on the validation set, allowing us to rapidly iterate through models and hyperparameters, achieving an average weighted F1 score of above 0.8 on cross validation across the three categories using a BERTweet Base model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BERTweet classifier for problem vs solution framing in senator tweets hits F1>0.8 but rests on unverified expert labels.

read the letter

The paper fine-tunes BERTweet Base on 3967 expert-labeled senator tweets to sort them into problem framing, solution framing, or other, then reports weighted F1 above 0.8 on cross-validation. They drew the labels from a 1.68 million tweet corpus and used straightforward 60/20/20 splits plus a held-out test set of 500. That is the core result. The task itself is new for this exact domain and model combination, and the authors show the classifier is usable at scale for tracking the garbage-can distinction in congressional Twitter. The data handling and model iteration steps are described plainly enough that someone could replicate the pipeline. The main weakness is the missing inter-annotator agreement for the two experts and the lack of any sampling details for how the 3967 tweets were chosen from the full set. Without those, the F1 number is harder to trust as evidence of reliable framing detection rather than label noise. No error analysis appears in the abstract either. This work is aimed at computational social scientists who need a practical starting point for large-scale political text analysis on social media. A reader already studying agenda setting or framing would find the model and labeled set useful as a baseline, though they would likely re-check the annotations themselves. It is solid enough to send to peer review; referees can ask for the agreement stats and sampling description without the paper falling apart.

Referee Report

2 major / 1 minor

Summary. The paper claims to classify US Senators' tweets (from a 1.68m corpus) into problem, solution, or other framing categories per the Garbage Can model of policy processes. Two experts labeled a 3967-tweet subset; a BERTweet Base model fine-tuned on 3467 tweets (via 60/20/20 splits) achieves average weighted F1 >0.8 on cross-validation, with a 500-tweet held-out test set.

Significance. If the expert labels prove reliable, the work offers a scalable tool for large-scale analysis of problem versus solution framing in congressional social media, directly supporting empirical tests of policy stream models. The reported F1 performance on a transformer baseline is a concrete, reproducible starting point for such studies.

major comments (2)

[Data and Annotation] Data and Annotation section: No inter-annotator agreement statistic (e.g., Cohen's kappa or percentage agreement) is reported for the two experts' labels on the 3967 tweets, nor is there any description of how disagreements were adjudicated. Because the weighted F1 >0.8 claim is computed directly against these labels, the absence of reliability metrics is load-bearing for the central performance result.
[Data and Annotation] Data and Annotation section: The manuscript provides no details on how the 3967 tweets were sampled from the full 1.68m corpus (e.g., random, stratified, keyword-based). Without this, it is impossible to evaluate selection bias or representativeness, which directly affects whether the F1 score generalizes to the full dataset.

minor comments (1)

[Abstract] Abstract: The description of the train/validation/test split ('60/20/20 proportions for fitting, validation, and development test sets') should be clarified as to whether cross-validation refers to k-fold or repeated hold-out, to avoid ambiguity with the reported 'cross validation' F1.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of transparency in our data and annotation process. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Data and Annotation] Data and Annotation section: No inter-annotator agreement statistic (e.g., Cohen's kappa or percentage agreement) is reported for the two experts' labels on the 3967 tweets, nor is there any description of how disagreements were adjudicated. Because the weighted F1 >0.8 claim is computed directly against these labels, the absence of reliability metrics is load-bearing for the central performance result.

Authors: We agree that inter-annotator agreement metrics are essential to substantiate label reliability. The two experts labeled the 3967 tweets independently, with disagreements resolved through discussion to reach consensus. We will add Cohen's kappa, percentage agreement, and a description of the adjudication process to the revised Data and Annotation section. revision: yes
Referee: [Data and Annotation] Data and Annotation section: The manuscript provides no details on how the 3967 tweets were sampled from the full 1.68m corpus (e.g., random, stratified, keyword-based). Without this, it is impossible to evaluate selection bias or representativeness, which directly affects whether the F1 score generalizes to the full dataset.

Authors: The 3967 tweets were obtained via random sampling from the full 1.68m corpus to support representativeness. We will explicitly describe this sampling procedure in the revised Data and Annotation section to allow evaluation of potential bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the supervised classification pipeline

full rationale

The paper trains a BERTweet model on expert-provided labels for 3967 tweets and reports weighted F1 >0.8 on cross-validation splits of that data plus a held-out test set. The performance metric is computed directly against the independent human labels on unseen examples; it is not reduced to the training inputs by construction, nor does any equation or self-citation chain equate the reported F1 to a fitted parameter or prior result. The derivation remains self-contained as a standard supervised evaluation against external annotations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that expert annotations provide reliable ground truth and that the pre-trained BERTweet model can be fine-tuned effectively on this domain-specific text.

free parameters (1)

BERTweet fine-tuning hyperparameters
Learning rate, batch size, and epochs are chosen during training on the labeled tweets.

axioms (1)

domain assumption Expert labels on 3967 tweets are accurate ground truth for problem/solution/other categories
Invoked when training the supervised classifier on the labeled subset.

pith-pipeline@v0.9.0 · 5530 in / 1081 out tokens · 37832 ms · 2026-05-15T13:38:34.194013+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieving an average weighted F1 score of above 0.8 on cross validation across the three categories using a BERTweet Base model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 3 internal anchors

[1]

Bishop, C. M. (2006). Pattern Recognition and Machine Learning . Springer

work page 2006
[2]

D., March, J

Cohen, M. D., March, J. G., and Olsen, J. P. (1972). A garbage can model of organizational choice. Administrative Science Quarterly , 17(1):1--25

work page 1972
[3]

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR , abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Frisli, S. (2025). Semi-supervised self-training for covid-19 misinformation detection: analyzing twitter data and alternative news media on norwegian twitter. Journal of Computational Social Science , 8(39)

work page 2025
[5]

Kingdon, J. W. (2003). Agendas, Alternatives, and Public Policies . Addison-Wesley

work page 2003
[6]

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[7]

and Koch, G

Landis, J. and Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics , 33(1):159--174

work page 1977
[8]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. (2017). Fixing weight decay regularization in adam. CoRR , abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

and Gurevych, I

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics

work page 2019
[10]

T. Chen, T. and C. Guestrin, C. (2016). Xgboost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , page 785–794

work page 2016

[1] [1]

Bishop, C. M. (2006). Pattern Recognition and Machine Learning . Springer

work page 2006

[2] [2]

D., March, J

Cohen, M. D., March, J. G., and Olsen, J. P. (1972). A garbage can model of organizational choice. Administrative Science Quarterly , 17(1):1--25

work page 1972

[3] [3]

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR , abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Frisli, S. (2025). Semi-supervised self-training for covid-19 misinformation detection: analyzing twitter data and alternative news media on norwegian twitter. Journal of Computational Social Science , 8(39)

work page 2025

[5] [5]

Kingdon, J. W. (2003). Agendas, Alternatives, and Public Policies . Addison-Wesley

work page 2003

[6] [6]

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[7] [7]

and Koch, G

Landis, J. and Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics , 33(1):159--174

work page 1977

[8] [8]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. (2017). Fixing weight decay regularization in adam. CoRR , abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

and Gurevych, I

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics

work page 2019

[10] [10]

T. Chen, T. and C. Guestrin, C. (2016). Xgboost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , page 785–794

work page 2016