pith. sign in

arxiv: 2604.02951 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

How Annotation Trains Annotators: Competence Development in Social Influence Recognition

Pith reviewed 2026-05-13 19:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords annotationcompetence developmentsocial influence recognitiondata qualityLLM performanceexpert annotatorsdialogue labeling
0
0 comments X p. Extension

The pith

Annotating dialogues for social influence techniques improves annotators' own competence and label quality over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether performing annotation on subjective social influence tasks acts as training for the annotators themselves. Annotators labeled 1,021 dialogues with 20 techniques plus intentions, reactions, and consequences; an initial 150-text subset was labeled both before and after the main work to track change. Self-reports, interviews, label-quality metrics, and downstream LLM performance all rose after the main annotation round, with larger gains among experts. A sympathetic reader cares because this turns annotation from a one-way data extraction step into a bidirectional process that shapes both the humans and the models trained on their output.

Core claim

Engaging in the annotation of 1,021 dialogues with 20 social influence techniques produces measurable competence gains: annotators return to the same 150-text comparison set with higher self-reported confidence, produce higher-quality labels, and generate data that trains LLMs to higher performance, with the competence lift more pronounced in expert groups than in non-experts.

What carries the argument

The before-and-after comparison on a fixed 150-text subset, paired with self-assessment surveys, semi-structured interviews, and LLM training/evaluation on the resulting labels.

If this is right

  • Annotation functions as an active competence-building activity rather than passive data collection.
  • Expert annotators exhibit larger competence gains than non-experts from the same process.
  • LLM performance on social-influence tasks depends on the annotators' competence level at the time of labeling.
  • Shifts in annotator judgments during a project visibly affect the quality of models trained on the collected data.
  • Repeated annotation passes on the same material can capture and exploit this training effect.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dataset pipelines could deliberately insert repeated annotation rounds on held-out subsets to harvest the competence gains.
  • One-shot annotation may systematically underestimate the final competence level that annotators would reach after full exposure.
  • The same training-through-annotation dynamic is likely to appear in other subjective labeling domains such as emotion or persuasion detection.

Load-bearing premise

Changes in label quality, self-reports, and LLM performance reflect genuine competence growth rather than task familiarity, fatigue, or other confounds in the before-after design.

What would settle it

A control group that annotates the 150-text comparison subset twice with no intervening main annotation task shows no significant rise in label quality or self-reported competence.

Figures

Figures reproduced from arXiv: 2604.02951 by Aleksander Szcz\k{e}sny, Aleksandra Sawczuk, Beata Bajcar, Grzegorz Chodak, Jagoda Szklarczyk, Jolanta Babiak, Karolina Ostrowska, Maciej Markiewicz, Przemys{\l}aw Kazienko, Tomasz Adamczyk, Wiktoria Mieleszczenko-Kowszewicz.

Figure 1
Figure 1. Figure 1: Overview of the studied annotation process with competence shift analyses. 4 Quantitative analysis methodology We assess annotator competence through data quality, defining increased compe￾tence as higher-quality work or equivalent quality in less time. We analyze quan￾titative differences in the identification of social influence between annotation rounds. Changes include technique labels, the number of "… view at source ↗
Figure 2
Figure 2. Figure 2: Shifts in the number of individual RICs by thematic categories. For reactions to social influence, a similar change in the balance of the dis￾tribution of thematic categories may be observed, with an increase in the least common categories. Language analyzes support this, with a more common use of casual language (154 changes from formal to casual, 74 vice versa), which may be associated with a higher numb… view at source ↗
Figure 3
Figure 3. Figure 3: Annotation time change over the course of the process. 6.3 Annotator agreement For technique classification, inter-annotator agreement (Krippendorff’s α with Jaccard distance) improved slightly from α = 0.319 to α = 0.327, though this difference was not statistically significant. Intra-annotator agreement, measured as the pairwise α between each annotator’s Pre and Post labels, was higher than inter-annota… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of mean scores of NASA-TLX dimensions between measurements. 6.7 Annotators’ perception of competence development and confidence [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Box Plots of Perceived Competence and Self-confidence across measurements. nique detection, increased awareness, and the ability to name specific mecha￾nisms (each 19.5%). Less common were awareness of consequences (7.3%) and knowledge of assertive responses (2.4%). These answers are reflected and elabo￾rated upon in the analysis of extended interviews: Changes in the interpretation of texts were described… view at source ↗
read the original abstract

Human data annotation, especially when involving experts, is often treated as an objective reference. However, many annotation tasks are inherently subjective, and annotators' judgments may evolve over time. This study investigates changes in the quality of annotators' work from a competence perspective during a process of social influence recognition. The study involved 25 annotators from five different groups, including both experts and non-experts, who annotated a dataset of 1,021 dialogues with 20 social influence techniques, along with intentions, reactions, and consequences. An initial subset of 150 texts was annotated twice - before and after the main annotation process - to enable comparison. To measure competence shifts, we combined qualitative and quantitative analyses of the annotated data, semi-structured interviews with annotators, self-assessment surveys, and Large Language Model training and evaluation on the comparison dataset. The results indicate a significant increase in annotators' self-perceived competence and confidence. Moreover, observed changes in data quality suggest that the annotation process may enhance annotator competence and that this effect is more pronounced in expert groups. The observed shifts in annotator competence have a visible impact on the performance of LLMs trained on their annotated data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that annotating 1,021 dialogues for 20 social influence techniques (plus intentions, reactions, and consequences) produces measurable competence development in 25 annotators drawn from expert and non-expert groups. Evidence is drawn from a before-after comparison on an initial 150-text subset, combined with self-assessment surveys, semi-structured interviews, quantitative label-quality metrics, and downstream LLM training/evaluation on the comparison set. The central results are reported increases in self-perceived competence and confidence, improvements in data quality that are stronger for experts, and visible gains in LLM performance when models are trained on the post-annotation data.

Significance. If the competence-development claim survives rigorous controls, the work would usefully demonstrate that annotation is not merely a data-collection step but an active learning process whose effects can be measured in label consistency, annotator self-reports, and downstream model quality. This would inform best practices for expert annotation pipelines and for using annotation tasks as implicit training mechanisms in subjective NLP domains.

major comments (3)
  1. [Methods] The before-after design on the 150-text comparison set (described in the Methods section on annotation procedure) contains no control arm, order counterbalancing, or isolation of learning mechanisms. Consequently, any observed shifts in agreement, confidence scores, or LLM F1 could arise from simple task familiarity, reduced initial uncertainty, or practice curves rather than genuine competence development. This directly weakens the stronger claims that the effect is 'more pronounced in expert groups' and that it 'has a visible impact' on trained models.
  2. [Results] The differential effect on expert versus non-expert groups is asserted without reported statistical interaction tests, subgroup effect sizes, or error bars on the quality and LLM metrics. The Results section therefore does not yet establish that the competence gains are reliably larger for experts.
  3. [Results] The LLM evaluation (final subsection of Results) reports performance differences on the comparison set but provides no baseline models trained on independent data, no details on training hyperparameters or data splits, and no ablation that isolates annotator-competence changes from other dataset properties. It is therefore unclear whether the reported 'visible impact' is attributable to the claimed competence shifts.
minor comments (2)
  1. [Abstract] The abstract states that results indicate a 'significant increase' in self-perceived competence; the corresponding Results section should report exact statistical tests, p-values, and effect sizes for all self-report and quality metrics.
  2. [Introduction] Clarify the exact operational definitions and inter-annotator agreement baselines for the 20 social-influence categories at the first mention in the Introduction or Methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important methodological and reporting issues that we will address to strengthen the manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Methods] The before-after design on the 150-text comparison set (described in the Methods section on annotation procedure) contains no control arm, order counterbalancing, or isolation of learning mechanisms. Consequently, any observed shifts in agreement, confidence scores, or LLM F1 could arise from simple task familiarity, reduced initial uncertainty, or practice curves rather than genuine competence development. This directly weakens the stronger claims that the effect is 'more pronounced in expert groups' and that it 'has a visible impact' on trained models.

    Authors: We agree that the within-subjects before-after design on the 150-text set lacks a separate control arm and counterbalancing, which prevents full isolation of competence development from practice or familiarity effects. Our interpretation relies on convergent evidence from self-reports, interviews, label-consistency metrics, and LLM performance rather than a single causal claim. In the revision we will add an explicit limitations paragraph that discusses these confounds, replace stronger causal phrasing with 'associated with' or 'suggests possible competence development,' and note that a controlled follow-up study would be needed to confirm mechanisms. revision: partial

  2. Referee: [Results] The differential effect on expert versus non-expert groups is asserted without reported statistical interaction tests, subgroup effect sizes, or error bars on the quality and LLM metrics. The Results section therefore does not yet establish that the competence gains are reliably larger for experts.

    Authors: We accept that the current Results section does not include formal tests for group-by-time interactions. In the revised version we will add mixed-effects models (or repeated-measures ANOVA) with group (expert/non-expert) × time (pre/post) interaction terms, report effect sizes, and include error bars or confidence intervals on all quality and LLM performance figures and tables. revision: yes

  3. Referee: [Results] The LLM evaluation (final subsection of Results) reports performance differences on the comparison set but provides no baseline models trained on independent data, no details on training hyperparameters or data splits, and no ablation that isolates annotator-competence changes from other dataset properties. It is therefore unclear whether the reported 'visible impact' is attributable to the claimed competence shifts.

    Authors: We will expand the LLM subsection with full details on hyperparameters, data splits, and training protocol. The core comparison is between models trained on the same 150 texts labeled before versus after the main annotation round by the same annotators; this design directly links performance change to shifts in the annotators' output. We did not include fully independent external baselines because the study focus was the within-annotator change rather than absolute performance. In revision we will clarify this rationale, add a short discussion of the limitation, and include a simple ablation (e.g., performance on label-shuffled versions of the post-annotation data). revision: partial

Circularity Check

0 steps flagged

No circularity: empirical before-after measurements are independent of inputs

full rationale

The paper is a purely empirical study relying on primary data collection from 25 annotators labeling 1,021 dialogues, with before-after comparisons on a 150-text subset, plus surveys, interviews, and downstream LLM training. No equations, fitted parameters, self-definitional constructs, or derivations appear in the provided text or abstract. Central claims rest on observed shifts in label quality, self-reported competence, and model performance metrics, which are measured outcomes rather than quantities defined in terms of themselves. Any self-citations (none load-bearing in the excerpt) would not reduce the results to inputs by construction. Methodological concerns about confounds exist but fall outside circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical observational study with no free parameters, axioms, or invented entities; relies on standard social science measurement assumptions.

pith-pipeline@v0.9.0 · 5571 in / 1113 out tokens · 44950 ms · 2026-05-13T19:45:10.862343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    In: Proc

    Abercrombie,G.,etal.:Temporalandsecondlanguageinfluenceonintra-annotator agreement and stability in hate speech labelling. In: Proc. 17th Linguistic An- notation Workshop (LAW-XVII). pp. 96–103. ACL, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.law-1.10

  2. [2]

    In: Proc

    Abercrombie, G., et al.: Consistency is key: Disentangling label variation in nlp with intra-annotator agreement. In: Proc. 4th Workshop on Perspectivist Ap- proaches to NLP. pp. 63–74. ACL, Suzhou, China (2025).https://doi.org/10. 18653/v1/2025.nlperspectives-1.6

  3. [3]

    ISBN 979-8-89176-256-5

    Bassi, D., et al.: Annotating the annotators: Analysis, insights and modelling from an annotation campaign on persuasion techniques detection. In: Findings of ACL. pp. 17918–17929. ACL, Vienna, Austria (2025).https://doi.org/10.18653/v1/ 2025.findings-acl.922

  4. [4]

    Qualitative Research in Psychology3(2), 77–101 (2006)

    Braun, V., Clarke, V.: Using thematic analysis in psychology. Qualitative Research in Psychology3(2), 77–101 (2006)

  5. [5]

    Family Medicine and Community Health7(2) (2019), e000057

    DeJonckheere, M., Vaughn, L.M.: Semistructured interviewing in primary care research. Family Medicine and Community Health7(2) (2019), e000057

  6. [6]

    In: Proc

    Doroudi, S., et al.: Toward a learning science for complex crowdsourcing tasks. In: Proc. CHI. p. 2623–2634. ACM, New York, NY, USA (2016).https://doi.org/ 10.1145/2858036.2858268

  7. [7]

    In: Proc

    Fleisig, E., et al.: The perspectivist paradigm shift: Assumptions and challenges of capturing human labels. In: Proc. NAACL-HLT. pp. 2279–2292. ACL, Mexico City, Mexico (2024).https://doi.org/10.18653/v1/2024.naacl-long.126

  8. [8]

    Hart and Lowell E

    Hart,S.G.,Staveland,L.E.:DevelopmentoftheNASA-TLX(TaskLoadIndex).In: Human Mental Workload. Advances in Psychology, vol. 52, pp. 139–183. Elsevier, Amsterdam (1988).https://doi.org/10.1016/S0166-4115(08)62386-9

  9. [9]

    In: Proc

    Hata, K., et al.: A glimpse far into the future: Understanding long-term crowd worker quality. In: Proc. CSCW. pp. 889–901. ACM (2017).https://doi.org/ 10.1145/2998181.2998248

  10. [10]

    Qual- itative Health Research15, 1277–1288 (11 2005).https://doi.org/10.1177/ 1049732305276687 Competence Development in Social Influence Recognition 15

    Hsieh, H.F., Shannon, S.: Three approaches to qualitative content analysis. Qual- itative Health Research15, 1277–1288 (11 2005).https://doi.org/10.1177/ 1049732305276687 Competence Development in Social Influence Recognition 15

  11. [11]

    Prentice-Hall, Englewood Cliffs, NJ (1984)

    Kolb, D.A.: Experiential Learning: Experience as the Source of Learning and De- velopment. Prentice-Hall, Englewood Cliffs, NJ (1984)

  12. [12]

    Lee, J.U., et al.: Annotation curricula to implicitly train non-expert annotators. Comput. Linguist.48, 343–373 (2022).https://doi.org/10.1162/coli_a_00436

  13. [13]

    Mohamed, R., et al.: Validation of the NASA-TLX to evaluate the learning curve for endoscopy training. Can. J. Gastroenterol. Hepatol.28, 892476 (2014).https: //doi.org/10.1155/2014/892476

  14. [14]

    In: Proc

    Mokhberian, N., et al.: Capturing perspectives of crowdsourced annotators in sub- jective learning tasks. In: Proc. NAACL-HLT. pp. 7337–7349. ACL, Mexico City, Mexico (2024).https://doi.org/10.18653/v1/2024.naacl-long.407

  15. [15]

    Renkl, A.: Toward an instructionally oriented theory of example-based learning. Cognit. Sci.38, 1–37 (2014).https://doi.org/10.1111/cogs.12086

  16. [16]

    Tai, R.H., et al.: An examination of the use of large language models to aid analysis of textual data. Int. J. Qual. Methods23, 16094069241231168 (2024).https:// doi.org/10.1177/16094069241231168

  17. [17]

    Cambridge University Press (2021)

    Vitello, S., et al.: What is competence? A shared interpretation of competence to support teaching, learning and assessment. Cambridge University Press (2021). https://doi.org/10.17863/CAM.110829

  18. [18]

    Yoo, H., et al.: Rethinking annotation: Can language learners contribute? In: Proc. ACL. pp. 14714–14733. ACL, Toronto, Canada (2023).https://doi.org/ 10.18653/v1/2023.acl-long.822