TWEETQA: A Social Media Focused Question Answering Dataset
Pith reviewed 2026-05-24 21:22 UTC · model grok-4.3
The pith
TweetQA is the first large-scale dataset for question answering over tweets, revealing that even fine-tuned BERT lags human performance significantly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present the first large-scale dataset for QA over social media data by collecting tweets used by journalists to write news articles and having annotators create questions and abstractive answers on them. Two recently proposed neural models perform poorly on this dataset compared to formal text, and even fine-tuned BERT lags behind human performance with a large margin.
What carries the argument
The TweetQA dataset, built from journalist-sourced tweets with abstractive QA pairs, used as a benchmark to demonstrate limitations of existing QA models on social media text.
If this is right
- QA systems for real-time knowledge from social media will require new approaches beyond those for news and Wikipedia.
- Models must handle abstractive answers rather than just extractive spans.
- The dataset provides a testbed to develop and evaluate social media specific QA techniques.
- Performance gaps indicate that informal language and noise in tweets pose unique challenges.
Where Pith is reading between the lines
- Extending the dataset to other social media platforms could reveal if Twitter-specific features drive the difficulty.
- Training models with more social media data might close the performance gap with humans.
- Applications like automated news summarization or event detection could use such QA systems if improved.
Load-bearing premise
That tweets selected because journalists used them to write news articles form a representative and useful sample for general social media QA.
What would settle it
Demonstrating a model that matches or exceeds human performance on the TweetQA dataset using standard techniques would undermine the claim that social media text presents distinct difficulties.
read the original abstract
With social media becoming increasingly pop-ular on which lots of news and real-time eventsare reported, developing automated questionanswering systems is critical to the effective-ness of many applications that rely on real-time knowledge. While previous datasets haveconcentrated on question answering (QA) forformal text like news and Wikipedia, wepresent the first large-scale dataset for QA oversocial media data. To ensure that the tweetswe collected are useful, we only gather tweetsused by journalists to write news articles. Wethen ask human annotators to write questionsand answers upon these tweets. Unlike otherQA datasets like SQuAD in which the answersare extractive, we allow the answers to be ab-stractive. We show that two recently proposedneural models that perform well on formaltexts are limited in their performance when ap-plied to our dataset. In addition, even the fine-tuned BERT model is still lagging behind hu-man performance with a large margin. Our re-sults thus point to the need of improved QAsystems targeting social media text.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents TWEETQA as the first large-scale QA dataset for social media, constructed by collecting tweets cited by journalists in news articles and having annotators generate questions with abstractive answers. It evaluates neural QA models and BERT on the dataset, finding substantial gaps relative to human performance, and argues for the development of improved systems targeting social media text.
Significance. Should the dataset prove representative of social media QA challenges, the work would be significant as it introduces a new benchmark in an important but under-served domain of informal, real-time text. The reported model-human performance gap provides concrete evidence of current limitations and could motivate targeted research. The provision of a dataset with abstractive answers over tweets is a notable contribution compared to extractive QA datasets like SQuAD.
major comments (1)
- [Abstract] The collection method restricts to tweets used by journalists to write news articles. This curation step preferentially selects for coherent and factual tweets, which may not represent the full distribution of social media content including noisy or opinion-based posts. Since the central claim is that this enables QA 'over social media data,' this representativeness assumption is load-bearing and requires explicit discussion or validation in the manuscript.
minor comments (1)
- [Abstract] Typo: 'pop-ular' should read 'popular'. Typo: 'eventsare' should read 'events are'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on dataset construction and representativeness. We address the major comment below and will revise the manuscript to incorporate an explicit discussion of the curation approach and its implications.
read point-by-point responses
-
Referee: [Abstract] The collection method restricts to tweets used by journalists to write news articles. This curation step preferentially selects for coherent and factual tweets, which may not represent the full distribution of social media content including noisy or opinion-based posts. Since the central claim is that this enables QA 'over social media data,' this representativeness assumption is load-bearing and requires explicit discussion or validation in the manuscript.
Authors: We agree that restricting to tweets cited by journalists introduces a curation bias toward more coherent and factual content, as opposed to the full range of noisy or opinion-based social media posts. This step was deliberate to ensure the collected tweets contain substantive information suitable for QA, as noted in the abstract and methods. The tweets nonetheless originate from Twitter and exhibit social-media-specific traits including informal language, abbreviations, and real-time context. In the revised manuscript we will add a dedicated limitations subsection that explicitly discusses the curation rationale, acknowledges the resulting deviation from the broader social-media distribution, and clarifies the scope of our central claim. Full empirical validation against the entire Twitter distribution is not feasible within the scope of this work due to the scale and ephemerality of social media data, but the added discussion will make the assumptions transparent. revision: yes
Circularity Check
No significant circularity; empirical dataset release with no derivation chain
full rationale
The paper introduces a new QA dataset collected from journalist-cited tweets and benchmarks models against it. There are no mathematical derivations, predictions from fitted parameters, or self-citation chains that reduce claims to inputs by construction. The central contributions are the dataset itself and empirical performance comparisons, which are self-contained against external benchmarks like SQuAD and human performance.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tweets used by journalists to write news articles are useful and representative for building a social media QA dataset.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we only gather tweets used by journalists to write news articles... first large-scale dataset for QA over social media data
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
even the fine-tuned BERT model is still lagging behind human performance with a large margin
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs
Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.