Recognition: 2 theorem links
· Lean TheoremWhen Annotators Agree but Labels Disagree: The Projection Problem in Stance Detection
Pith reviewed 2026-05-15 00:29 UTC · model grok-4.3
The pith
Annotators agree more on separate dimensions of an attitude than when compressing them into one stance label.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Stance detection labels require projecting multi-dimensional attitudes into a single favor-against-neutral category. When the same annotators provide both the overall label and target-specific dimension judgments, agreement on the dimensions is consistently higher than agreement on the compressed labels. The difference scales with target complexity, remaining modest for single-entity targets but large for multi-faceted policy targets.
What carries the argument
The projection problem: the forced compression of multi-dimensional attitudes into a single unitary stance label, which produces disagreement from differing dimension-weighting choices.
If this is right
- Stance detection datasets contain systematic disagreement traceable to projection choices rather than to text ambiguity alone.
- Dimensional annotation yields higher inter-annotator reliability than single-label annotation for complex targets.
- Reported performance on stance benchmarks partly reflects how well models learn common compression strategies.
- Disagreement metrics in stance work should be decomposed into projection variance versus other sources of noise.
- For policy-oriented targets, single-label tasks may impose a lower performance ceiling than dimensional tasks.
Where Pith is reading between the lines
- Models could output per-dimension predictions and apply a controllable aggregation step instead of learning a single compressed label.
- The same compression issue likely affects other opinion tasks such as aspect-based sentiment analysis when targets have internal structure.
- Allowing annotators to note which dimensions drove their label might reduce apparent disagreement without changing the label set.
- The gap's dependence on target complexity suggests benchmarks should stratify targets by dimensionality to measure the effect reliably.
Load-bearing premise
The bottom-up dimensions for each target fully and non-overlappingly represent the attitudes being labeled, and the annotators' dimension judgments are independent of their overall label choices.
What would settle it
A replication using the same targets but newly elicited dimensions or a different annotator group where label agreement equals or exceeds dimensional agreement would challenge the claim.
read the original abstract
Stance detection is nearly always formulated as classifying text into Favor, Against, or Neutral. This convention was inherited from debate analysis and has been applied without modification to social media since SemEval-2016. However, attitudes toward complex targets are not unitary. A person can accept climate science while opposing carbon taxes, expressing support on one dimension and opposition on another. When annotators must compress such multi-dimensional attitudes into a single label, different annotators may weight different dimensions, producing disagreement that reflects different compression choices rather than confusion. We call this the projection problem. We conduct an annotation study across five targets from three stance benchmarks (SemEval-2016, P-Stance, COVID-19-Stance), with the same three annotators labeling all targets. For each target, annotators assign both a standard stance label and per-dimension judgments along target-specific dimensions discovered through bottom-up analysis, using the same number of categories for both. Across all fifteen target--dimension pairs, dimensional agreement consistently exceeds label agreement. The gap appears to scale with target complexity: modest for a single-entity target like Joe Biden (AC1: 0.87 vs. 0.95), but large for a multi-faceted policy target like school closures (AC1: 0.21 vs. 0.71).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the 'projection problem' in stance detection, arguing that compressing multi-dimensional attitudes toward complex targets into single Favor/Against/Neutral labels causes annotator disagreement due to differing dimension weightings. It reports an annotation study using the same three annotators across five targets from SemEval-2016, P-Stance, and COVID-19-Stance benchmarks. For each target, annotators provide both standard labels and ratings on bottom-up discovered target-specific dimensions (with matched category counts). The central empirical result is that dimensional agreement (AC1) exceeds label agreement across all 15 target-dimension pairs, with the gap appearing larger for complex targets (e.g., 0.21 vs. 0.71 for school closures) than simple ones (e.g., 0.87 vs. 0.95 for Joe Biden).
Significance. If the result holds after addressing design concerns, the work identifies a previously under-examined source of label noise in stance datasets and demonstrates that explicit dimensional annotations can yield higher inter-annotator agreement. This could improve the reliability of training data for stance detection models in NLP, particularly for policy or multi-faceted targets, and encourage shifts away from the inherited three-way label convention. The cross-benchmark empirical comparison is a concrete strength.
major comments (2)
- [§3] §3 (Annotation Study): The procedure uses the identical three annotators to produce both the standard labels and the per-dimension ratings after bottom-up dimension discovery by those same annotators. This design does not include controls (e.g., separate annotator pools or blinded ordering) to establish independence between dimensional judgments and prior label choices, so the reported AC1 gap could partly reflect within-annotator consistency rather than evidence that dimensions solve the projection problem.
- [Results] Results section: The claim that the agreement gap 'appears to scale with target complexity' is presented without a statistical test or formal comparison (e.g., correlation between gap size and a pre-defined complexity metric across the five targets). With only five targets, the pattern (modest gap for Joe Biden, large gap for school closures) remains descriptive and does not yet support the scaling generalization.
minor comments (1)
- [Abstract] The abstract states that dimensions were 'discovered through bottom-up analysis' but does not specify the exact protocol (e.g., how many dimensions per target, inter-annotator agreement during discovery, or whether dimensions were required to be non-overlapping).
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We address the two major comments below by clarifying the study design rationale, acknowledging limitations, and committing to revisions that strengthen the presentation without overstating the results.
read point-by-point responses
-
Referee: [§3] §3 (Annotation Study): The procedure uses the identical three annotators to produce both the standard labels and the per-dimension ratings after bottom-up dimension discovery by those same annotators. This design does not include controls (e.g., separate annotator pools or blinded ordering) to establish independence between dimensional judgments and prior label choices, so the reported AC1 gap could partly reflect within-annotator consistency rather than evidence that dimensions solve the projection problem.
Authors: We appreciate the referee's observation on this design choice. The same three annotators were used deliberately to enable a controlled within-subject comparison of label versus dimensional agreement, thereby isolating the effect of the projection problem from inter-annotator differences. Dimensions were elicited in a separate bottom-up phase before any labels were assigned, and annotators were explicitly instructed to rate each dimension independently based on the tweet content. Nevertheless, we acknowledge that this setup cannot fully rule out within-annotator consistency effects. We will revise §3 to state this limitation explicitly and to recommend that future validation studies employ independent annotator pools and blinded procedures. revision: yes
-
Referee: [Results] Results section: The claim that the agreement gap 'appears to scale with target complexity' is presented without a statistical test or formal comparison (e.g., correlation between gap size and a pre-defined complexity metric across the five targets). With only five targets, the pattern (modest gap for Joe Biden, large gap for school closures) remains descriptive and does not yet support the scaling generalization.
Authors: We agree that the observed pattern is descriptive. With only five targets, any formal statistical test (e.g., correlation with a complexity metric) would have insufficient power to support generalization. We will revise the Results section to present the gaps as an observed trend in the current data, remove any implication of a confirmed scaling relationship, and add a sentence in the Discussion noting that larger-scale studies would be required to test this hypothesis formally. revision: yes
Circularity Check
No circularity: direct empirical comparison of agreement metrics
full rationale
The paper reports results from a fresh annotation study in which three annotators assign both standard stance labels and per-dimension ratings for the same texts. The central finding (dimensional AC1 consistently higher than label AC1 across fifteen target-dimension pairs) is obtained by direct computation of agreement statistics on the collected judgments. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked that would reduce this observed difference to the inputs by construction. The design choices (bottom-up dimension discovery, matched category counts) are methodological assumptions whose validity can be evaluated externally; they do not create a self-referential derivation chain.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math AC1 is an appropriate agreement metric for the categorical judgments collected.
- domain assumption The bottom-up dimensions discovered for each target capture the main axes of attitude variation without significant omission or overlap.
invented entities (1)
-
projection problem
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.