TWEETQA: A Social Media Focused Question Answering Dataset

Hong Wang; Jiawei Wu; Mo Yu; Shiyu Chang; Vivek Kulkarni; Wenhan Xiong; William Yang Wang; Xiaoxiao Guo

arxiv: 1907.06292 · v1 · pith:FFRNT6FQnew · submitted 2019-07-14 · 💻 cs.CL

TWEETQA: A Social Media Focused Question Answering Dataset

Wenhan Xiong , Jiawei Wu , Hong Wang , Vivek Kulkarni , Mo Yu , Shiyu Chang , Xiaoxiao Guo , William Yang Wang This is my paper

Pith reviewed 2026-05-24 21:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords question answeringsocial mediaTweetQAabstractive QABERTneural modelsdataset

0 comments

The pith

TweetQA is the first large-scale dataset for question answering over tweets, revealing that even fine-tuned BERT lags human performance significantly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors build a dataset of questions and abstractive answers from tweets that journalists used in news articles. They test neural models known to work on formal text and find they perform much worse on this social media data. This matters because many real-time events and news are reported on social media, so effective QA systems there could support better applications. The large gap with human performance suggests current methods need adaptation for informal text.

Core claim

We present the first large-scale dataset for QA over social media data by collecting tweets used by journalists to write news articles and having annotators create questions and abstractive answers on them. Two recently proposed neural models perform poorly on this dataset compared to formal text, and even fine-tuned BERT lags behind human performance with a large margin.

What carries the argument

The TweetQA dataset, built from journalist-sourced tweets with abstractive QA pairs, used as a benchmark to demonstrate limitations of existing QA models on social media text.

If this is right

QA systems for real-time knowledge from social media will require new approaches beyond those for news and Wikipedia.
Models must handle abstractive answers rather than just extractive spans.
The dataset provides a testbed to develop and evaluate social media specific QA techniques.
Performance gaps indicate that informal language and noise in tweets pose unique challenges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the dataset to other social media platforms could reveal if Twitter-specific features drive the difficulty.
Training models with more social media data might close the performance gap with humans.
Applications like automated news summarization or event detection could use such QA systems if improved.

Load-bearing premise

That tweets selected because journalists used them to write news articles form a representative and useful sample for general social media QA.

What would settle it

Demonstrating a model that matches or exceeds human performance on the TweetQA dataset using standard techniques would undermine the claim that social media text presents distinct difficulties.

read the original abstract

With social media becoming increasingly pop-ular on which lots of news and real-time eventsare reported, developing automated questionanswering systems is critical to the effective-ness of many applications that rely on real-time knowledge. While previous datasets haveconcentrated on question answering (QA) forformal text like news and Wikipedia, wepresent the first large-scale dataset for QA oversocial media data. To ensure that the tweetswe collected are useful, we only gather tweetsused by journalists to write news articles. Wethen ask human annotators to write questionsand answers upon these tweets. Unlike otherQA datasets like SQuAD in which the answersare extractive, we allow the answers to be ab-stractive. We show that two recently proposedneural models that perform well on formaltexts are limited in their performance when ap-plied to our dataset. In addition, even the fine-tuned BERT model is still lagging behind hu-man performance with a large margin. Our re-sults thus point to the need of improved QAsystems targeting social media text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents TWEETQA as the first large-scale QA dataset for social media, constructed by collecting tweets cited by journalists in news articles and having annotators generate questions with abstractive answers. It evaluates neural QA models and BERT on the dataset, finding substantial gaps relative to human performance, and argues for the development of improved systems targeting social media text.

Significance. Should the dataset prove representative of social media QA challenges, the work would be significant as it introduces a new benchmark in an important but under-served domain of informal, real-time text. The reported model-human performance gap provides concrete evidence of current limitations and could motivate targeted research. The provision of a dataset with abstractive answers over tweets is a notable contribution compared to extractive QA datasets like SQuAD.

major comments (1)

[Abstract] The collection method restricts to tweets used by journalists to write news articles. This curation step preferentially selects for coherent and factual tweets, which may not represent the full distribution of social media content including noisy or opinion-based posts. Since the central claim is that this enables QA 'over social media data,' this representativeness assumption is load-bearing and requires explicit discussion or validation in the manuscript.

minor comments (1)

[Abstract] Typo: 'pop-ular' should read 'popular'. Typo: 'eventsare' should read 'events are'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on dataset construction and representativeness. We address the major comment below and will revise the manuscript to incorporate an explicit discussion of the curation approach and its implications.

read point-by-point responses

Referee: [Abstract] The collection method restricts to tweets used by journalists to write news articles. This curation step preferentially selects for coherent and factual tweets, which may not represent the full distribution of social media content including noisy or opinion-based posts. Since the central claim is that this enables QA 'over social media data,' this representativeness assumption is load-bearing and requires explicit discussion or validation in the manuscript.

Authors: We agree that restricting to tweets cited by journalists introduces a curation bias toward more coherent and factual content, as opposed to the full range of noisy or opinion-based social media posts. This step was deliberate to ensure the collected tweets contain substantive information suitable for QA, as noted in the abstract and methods. The tweets nonetheless originate from Twitter and exhibit social-media-specific traits including informal language, abbreviations, and real-time context. In the revised manuscript we will add a dedicated limitations subsection that explicitly discusses the curation rationale, acknowledges the resulting deviation from the broader social-media distribution, and clarifies the scope of our central claim. Full empirical validation against the entire Twitter distribution is not feasible within the scope of this work due to the scale and ephemerality of social media data, but the added discussion will make the assumptions transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical dataset release with no derivation chain

full rationale

The paper introduces a new QA dataset collected from journalist-cited tweets and benchmarks models against it. There are no mathematical derivations, predictions from fitted parameters, or self-citation chains that reduce claims to inputs by construction. The central contributions are the dataset itself and empirical performance comparisons, which are self-contained against external benchmarks like SQuAD and human performance.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that journalist-cited tweets are a suitable proxy for social media text; no free parameters, invented entities, or additional axioms are introduced.

axioms (1)

domain assumption Tweets used by journalists to write news articles are useful and representative for building a social media QA dataset.
Explicit selection criterion stated in the abstract.

pith-pipeline@v0.9.0 · 5733 in / 1099 out tokens · 23354 ms · 2026-05-24T21:22:30.782673+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we only gather tweets used by journalists to write news articles... first large-scale dataset for QA over social media data
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

even the fine-tuned BERT model is still lagging behind human performance with a large margin

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.