Classifying Problem and Solution Framing in Congressional Social Media
Pith reviewed 2026-05-15 13:38 UTC · model grok-4.3
The pith
A BERTweet model classifies US Senators' tweets as problem-focused, solution-focused, or other with weighted F1 above 0.8.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using supervised learning on a labeled subset of 3967 tweets, a BERTweet Base model classifies US Senator tweets into problem, solution, or other categories with an average weighted F1 score above 0.8 on cross validation.
What carries the argument
A fine-tuned BERTweet Base model for three-class text classification of tweets into problem, solution, or other.
If this is right
- The full 1.68 million tweets can now be labeled at scale to measure overall problem versus solution emphasis in senatorial posts.
- Temporal tracking becomes possible to observe shifts in framing around legislative events or elections.
- Comparisons across individual senators or political parties can quantify differences in their use of problem and solution language.
- The classifier supports direct tests of the Garbage Can model's problem-solution distinction in digital policy communication.
Where Pith is reading between the lines
- The same approach could extend to other platforms or elected officials to compare framing styles across contexts.
- Pairing the labels with topic models might show which problems are most often linked to specific solutions in the tweets.
- If the classifier holds, it could enable studies linking tweet framing to public engagement metrics like replies or retweets.
Load-bearing premise
Expert labels on the 3967 tweets are accurate and representative of the full 1.68 million tweet corpus without major annotation errors or selection bias.
What would settle it
Labeling a new independent sample of 500 tweets from the same corpus with the same expert criteria and finding the model's weighted F1 on that sample falls below 0.7.
Figures
read the original abstract
Policy setting in the USA according to the ``Garbage Can'' model differentiates between ``problem'' and ``solution'' focused processes. In this paper, we study a large dataset of US Senator postings on Twitter (1.68m tweets in total). Our objective is to develop an automated method to label Senatorial posts as either in the problem or solution streams. Two academic policy experts labeled a subset of 3967 tweets as either problem, solution, or other (anything not problem or solution). We split off a subset of 500 tweets into a test set, with the remaining 3467 used for training. During development, this training set was further split by 60/20/20 proportions for fitting, validation, and development test sets. We investigated supervised learning methods for building problem/solution classifiers directly on the training set, evaluating their performance in terms of F1 score on the validation set, allowing us to rapidly iterate through models and hyperparameters, achieving an average weighted F1 score of above 0.8 on cross validation across the three categories using a BERTweet Base model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to classify US Senators' tweets (from a 1.68m corpus) into problem, solution, or other framing categories per the Garbage Can model of policy processes. Two experts labeled a 3967-tweet subset; a BERTweet Base model fine-tuned on 3467 tweets (via 60/20/20 splits) achieves average weighted F1 >0.8 on cross-validation, with a 500-tweet held-out test set.
Significance. If the expert labels prove reliable, the work offers a scalable tool for large-scale analysis of problem versus solution framing in congressional social media, directly supporting empirical tests of policy stream models. The reported F1 performance on a transformer baseline is a concrete, reproducible starting point for such studies.
major comments (2)
- [Data and Annotation] Data and Annotation section: No inter-annotator agreement statistic (e.g., Cohen's kappa or percentage agreement) is reported for the two experts' labels on the 3967 tweets, nor is there any description of how disagreements were adjudicated. Because the weighted F1 >0.8 claim is computed directly against these labels, the absence of reliability metrics is load-bearing for the central performance result.
- [Data and Annotation] Data and Annotation section: The manuscript provides no details on how the 3967 tweets were sampled from the full 1.68m corpus (e.g., random, stratified, keyword-based). Without this, it is impossible to evaluate selection bias or representativeness, which directly affects whether the F1 score generalizes to the full dataset.
minor comments (1)
- [Abstract] Abstract: The description of the train/validation/test split ('60/20/20 proportions for fitting, validation, and development test sets') should be clarified as to whether cross-validation refers to k-fold or repeated hold-out, to avoid ambiguity with the reported 'cross validation' F1.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of transparency in our data and annotation process. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Data and Annotation] Data and Annotation section: No inter-annotator agreement statistic (e.g., Cohen's kappa or percentage agreement) is reported for the two experts' labels on the 3967 tweets, nor is there any description of how disagreements were adjudicated. Because the weighted F1 >0.8 claim is computed directly against these labels, the absence of reliability metrics is load-bearing for the central performance result.
Authors: We agree that inter-annotator agreement metrics are essential to substantiate label reliability. The two experts labeled the 3967 tweets independently, with disagreements resolved through discussion to reach consensus. We will add Cohen's kappa, percentage agreement, and a description of the adjudication process to the revised Data and Annotation section. revision: yes
-
Referee: [Data and Annotation] Data and Annotation section: The manuscript provides no details on how the 3967 tweets were sampled from the full 1.68m corpus (e.g., random, stratified, keyword-based). Without this, it is impossible to evaluate selection bias or representativeness, which directly affects whether the F1 score generalizes to the full dataset.
Authors: The 3967 tweets were obtained via random sampling from the full 1.68m corpus to support representativeness. We will explicitly describe this sampling procedure in the revised Data and Annotation section to allow evaluation of potential bias. revision: yes
Circularity Check
No significant circularity in the supervised classification pipeline
full rationale
The paper trains a BERTweet model on expert-provided labels for 3967 tweets and reports weighted F1 >0.8 on cross-validation splits of that data plus a held-out test set. The performance metric is computed directly against the independent human labels on unseen examples; it is not reduced to the training inputs by construction, nor does any equation or self-citation chain equate the reported F1 to a fitted parameter or prior result. The derivation remains self-contained as a standard supervised evaluation against external annotations.
Axiom & Free-Parameter Ledger
free parameters (1)
- BERTweet fine-tuning hyperparameters
axioms (1)
- domain assumption Expert labels on 3967 tweets are accurate ground truth for problem/solution/other categories
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
achieving an average weighted F1 score of above 0.8 on cross validation across the three categories using a BERTweet Base model
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bishop, C. M. (2006). Pattern Recognition and Machine Learning . Springer
work page 2006
-
[2]
Cohen, M. D., March, J. G., and Olsen, J. P. (1972). A garbage can model of organizational choice. Administrative Science Quarterly , 17(1):1--25
work page 1972
-
[3]
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR , abs/1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Frisli, S. (2025). Semi-supervised self-training for covid-19 misinformation detection: analyzing twitter data and alternative news media on norwegian twitter. Journal of Computational Social Science , 8(39)
work page 2025
-
[5]
Kingdon, J. W. (2003). Agendas, Alternatives, and Public Policies . Addison-Wesley
work page 2003
-
[6]
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[7]
Landis, J. and Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics , 33(1):159--174
work page 1977
-
[8]
Decoupled Weight Decay Regularization
Loshchilov, I. and Hutter, F. (2017). Fixing weight decay regularization in adam. CoRR , abs/1711.05101
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics
work page 2019
-
[10]
T. Chen, T. and C. Guestrin, C. (2016). Xgboost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , page 785–794
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.