How Annotation Trains Annotators: Competence Development in Social Influence Recognition

arxiv: 2604.02951 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

How Annotation Trains Annotators: Competence Development in Social Influence Recognition

Maciej Markiewicz , Beata Bajcar , Wiktoria Mieleszczenko-Kowszewicz , Aleksander Szcz\k{e}sny , Tomasz Adamczyk , Grzegorz Chodak , Karolina Ostrowska , Aleksandra Sawczuk

show 3 more authors

Jolanta Babiak Jagoda Szklarczyk Przemys{\l}aw Kazienko

This is my paper

Pith reviewed 2026-05-13 19:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords annotationcompetence developmentsocial influence recognitiondata qualityLLM performanceexpert annotatorsdialogue labeling

0 comments p. Extension

The pith

Annotating dialogues for social influence techniques improves annotators' own competence and label quality over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether performing annotation on subjective social influence tasks acts as training for the annotators themselves. Annotators labeled 1,021 dialogues with 20 techniques plus intentions, reactions, and consequences; an initial 150-text subset was labeled both before and after the main work to track change. Self-reports, interviews, label-quality metrics, and downstream LLM performance all rose after the main annotation round, with larger gains among experts. A sympathetic reader cares because this turns annotation from a one-way data extraction step into a bidirectional process that shapes both the humans and the models trained on their output.

Core claim

Engaging in the annotation of 1,021 dialogues with 20 social influence techniques produces measurable competence gains: annotators return to the same 150-text comparison set with higher self-reported confidence, produce higher-quality labels, and generate data that trains LLMs to higher performance, with the competence lift more pronounced in expert groups than in non-experts.

What carries the argument

The before-and-after comparison on a fixed 150-text subset, paired with self-assessment surveys, semi-structured interviews, and LLM training/evaluation on the resulting labels.

If this is right

Annotation functions as an active competence-building activity rather than passive data collection.
Expert annotators exhibit larger competence gains than non-experts from the same process.
LLM performance on social-influence tasks depends on the annotators' competence level at the time of labeling.
Shifts in annotator judgments during a project visibly affect the quality of models trained on the collected data.
Repeated annotation passes on the same material can capture and exploit this training effect.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dataset pipelines could deliberately insert repeated annotation rounds on held-out subsets to harvest the competence gains.
One-shot annotation may systematically underestimate the final competence level that annotators would reach after full exposure.
The same training-through-annotation dynamic is likely to appear in other subjective labeling domains such as emotion or persuasion detection.

Load-bearing premise

Changes in label quality, self-reports, and LLM performance reflect genuine competence growth rather than task familiarity, fatigue, or other confounds in the before-after design.

What would settle it

A control group that annotates the 150-text comparison subset twice with no intervening main annotation task shows no significant rise in label quality or self-reported competence.

Figures

Figures reproduced from arXiv: 2604.02951 by Aleksander Szcz\k{e}sny, Aleksandra Sawczuk, Beata Bajcar, Grzegorz Chodak, Jagoda Szklarczyk, Jolanta Babiak, Karolina Ostrowska, Maciej Markiewicz, Przemys{\l}aw Kazienko, Tomasz Adamczyk, Wiktoria Mieleszczenko-Kowszewicz.

**Figure 1.** Figure 1: Overview of the studied annotation process with competence shift analyses. 4 Quantitative analysis methodology We assess annotator competence through data quality, defining increased competence as higher-quality work or equivalent quality in less time. We analyze quantitative differences in the identification of social influence between annotation rounds. Changes include technique labels, the number of "… view at source ↗

**Figure 2.** Figure 2: Shifts in the number of individual RICs by thematic categories. For reactions to social influence, a similar change in the balance of the distribution of thematic categories may be observed, with an increase in the least common categories. Language analyzes support this, with a more common use of casual language (154 changes from formal to casual, 74 vice versa), which may be associated with a higher numb… view at source ↗

**Figure 3.** Figure 3: Annotation time change over the course of the process. 6.3 Annotator agreement For technique classification, inter-annotator agreement (Krippendorff’s α with Jaccard distance) improved slightly from α = 0.319 to α = 0.327, though this difference was not statistically significant. Intra-annotator agreement, measured as the pairwise α between each annotator’s Pre and Post labels, was higher than inter-annota… view at source ↗

**Figure 4.** Figure 4: Comparison of mean scores of NASA-TLX dimensions between measurements. 6.7 Annotators’ perception of competence development and confidence [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Box Plots of Perceived Competence and Self-confidence across measurements. nique detection, increased awareness, and the ability to name specific mechanisms (each 19.5%). Less common were awareness of consequences (7.3%) and knowledge of assertive responses (2.4%). These answers are reflected and elaborated upon in the analysis of extended interviews: Changes in the interpretation of texts were described… view at source ↗

read the original abstract

Human data annotation, especially when involving experts, is often treated as an objective reference. However, many annotation tasks are inherently subjective, and annotators' judgments may evolve over time. This study investigates changes in the quality of annotators' work from a competence perspective during a process of social influence recognition. The study involved 25 annotators from five different groups, including both experts and non-experts, who annotated a dataset of 1,021 dialogues with 20 social influence techniques, along with intentions, reactions, and consequences. An initial subset of 150 texts was annotated twice - before and after the main annotation process - to enable comparison. To measure competence shifts, we combined qualitative and quantitative analyses of the annotated data, semi-structured interviews with annotators, self-assessment surveys, and Large Language Model training and evaluation on the comparison dataset. The results indicate a significant increase in annotators' self-perceived competence and confidence. Moreover, observed changes in data quality suggest that the annotation process may enhance annotator competence and that this effect is more pronounced in expert groups. The observed shifts in annotator competence have a visible impact on the performance of LLMs trained on their annotated data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Annotation builds competence in spotting social influence techniques with some LLM payoff, but the before-after design on the same 150 texts leaves practice effects as a live alternative.

read the letter

The main point is that annotators improve at recognizing the 20 social influence techniques as they work through the 1,021 dialogues, with experts showing larger gains and the improved labels producing measurable lifts in LLM performance. The paper backs this with label-quality shifts, self-reports, interviews, and direct LLM training tests on the comparison set. That multi-angle approach is a step up from pure agreement metrics and gives the claim some grounding in real outcomes rather than just theory. The mixed expert and non-expert groups also let them surface differences that single-group studies often miss. What the work does cleanly is treat annotation as an active process that can change the annotator, not just a static data-collection step. The dataset size and the downstream LLM evaluation add practical weight for people who care about training data quality. The soft spot sits in the design. The key comparison re-annotates the same initial 150 texts after the main task, so any rise in confidence, agreement, or model F1 could come from simple familiarity with the categories, reduced initial uncertainty, or even fatigue curves rather than genuine competence growth. No control arm or order counterbalancing is described to separate those, which weakens the stronger claims about expert-specific effects and visible LLM impact. Without error bars, exact statistical controls, or a breakdown of which techniques drove the changes, the size and mechanism of the effect stay hard to pin down. This paper is for researchers who run annotation pipelines in NLP or social science and want to think about how the labeling process itself shapes data quality. Readers focused on subjective tasks or on closing the loop between human annotation and model training will find usable ideas here. It deserves a serious referee because the question is concrete and the methods mix is promising, even if the current evidence needs tightening on confounds. Send it to review with requests for clearer controls and quantitative details on the before-after shifts.

Referee Report

3 major / 2 minor

Summary. The paper claims that annotating 1,021 dialogues for 20 social influence techniques (plus intentions, reactions, and consequences) produces measurable competence development in 25 annotators drawn from expert and non-expert groups. Evidence is drawn from a before-after comparison on an initial 150-text subset, combined with self-assessment surveys, semi-structured interviews, quantitative label-quality metrics, and downstream LLM training/evaluation on the comparison set. The central results are reported increases in self-perceived competence and confidence, improvements in data quality that are stronger for experts, and visible gains in LLM performance when models are trained on the post-annotation data.

Significance. If the competence-development claim survives rigorous controls, the work would usefully demonstrate that annotation is not merely a data-collection step but an active learning process whose effects can be measured in label consistency, annotator self-reports, and downstream model quality. This would inform best practices for expert annotation pipelines and for using annotation tasks as implicit training mechanisms in subjective NLP domains.

major comments (3)

[Methods] The before-after design on the 150-text comparison set (described in the Methods section on annotation procedure) contains no control arm, order counterbalancing, or isolation of learning mechanisms. Consequently, any observed shifts in agreement, confidence scores, or LLM F1 could arise from simple task familiarity, reduced initial uncertainty, or practice curves rather than genuine competence development. This directly weakens the stronger claims that the effect is 'more pronounced in expert groups' and that it 'has a visible impact' on trained models.
[Results] The differential effect on expert versus non-expert groups is asserted without reported statistical interaction tests, subgroup effect sizes, or error bars on the quality and LLM metrics. The Results section therefore does not yet establish that the competence gains are reliably larger for experts.
[Results] The LLM evaluation (final subsection of Results) reports performance differences on the comparison set but provides no baseline models trained on independent data, no details on training hyperparameters or data splits, and no ablation that isolates annotator-competence changes from other dataset properties. It is therefore unclear whether the reported 'visible impact' is attributable to the claimed competence shifts.

minor comments (2)

[Abstract] The abstract states that results indicate a 'significant increase' in self-perceived competence; the corresponding Results section should report exact statistical tests, p-values, and effect sizes for all self-report and quality metrics.
[Introduction] Clarify the exact operational definitions and inter-annotator agreement baselines for the 20 social-influence categories at the first mention in the Introduction or Methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important methodological and reporting issues that we will address to strengthen the manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [Methods] The before-after design on the 150-text comparison set (described in the Methods section on annotation procedure) contains no control arm, order counterbalancing, or isolation of learning mechanisms. Consequently, any observed shifts in agreement, confidence scores, or LLM F1 could arise from simple task familiarity, reduced initial uncertainty, or practice curves rather than genuine competence development. This directly weakens the stronger claims that the effect is 'more pronounced in expert groups' and that it 'has a visible impact' on trained models.

Authors: We agree that the within-subjects before-after design on the 150-text set lacks a separate control arm and counterbalancing, which prevents full isolation of competence development from practice or familiarity effects. Our interpretation relies on convergent evidence from self-reports, interviews, label-consistency metrics, and LLM performance rather than a single causal claim. In the revision we will add an explicit limitations paragraph that discusses these confounds, replace stronger causal phrasing with 'associated with' or 'suggests possible competence development,' and note that a controlled follow-up study would be needed to confirm mechanisms. revision: partial
Referee: [Results] The differential effect on expert versus non-expert groups is asserted without reported statistical interaction tests, subgroup effect sizes, or error bars on the quality and LLM metrics. The Results section therefore does not yet establish that the competence gains are reliably larger for experts.

Authors: We accept that the current Results section does not include formal tests for group-by-time interactions. In the revised version we will add mixed-effects models (or repeated-measures ANOVA) with group (expert/non-expert) × time (pre/post) interaction terms, report effect sizes, and include error bars or confidence intervals on all quality and LLM performance figures and tables. revision: yes
Referee: [Results] The LLM evaluation (final subsection of Results) reports performance differences on the comparison set but provides no baseline models trained on independent data, no details on training hyperparameters or data splits, and no ablation that isolates annotator-competence changes from other dataset properties. It is therefore unclear whether the reported 'visible impact' is attributable to the claimed competence shifts.

Authors: We will expand the LLM subsection with full details on hyperparameters, data splits, and training protocol. The core comparison is between models trained on the same 150 texts labeled before versus after the main annotation round by the same annotators; this design directly links performance change to shifts in the annotators' output. We did not include fully independent external baselines because the study focus was the within-annotator change rather than absolute performance. In revision we will clarify this rationale, add a short discussion of the limitation, and include a simple ablation (e.g., performance on label-shuffled versions of the post-annotation data). revision: partial

Circularity Check

0 steps flagged

No circularity: empirical before-after measurements are independent of inputs

full rationale

The paper is a purely empirical study relying on primary data collection from 25 annotators labeling 1,021 dialogues, with before-after comparisons on a 150-text subset, plus surveys, interviews, and downstream LLM training. No equations, fitted parameters, self-definitional constructs, or derivations appear in the provided text or abstract. Central claims rest on observed shifts in label quality, self-reported competence, and model performance metrics, which are measured outcomes rather than quantities defined in terms of themselves. Any self-citations (none load-bearing in the excerpt) would not reduce the results to inputs by construction. Methodological concerns about confounds exist but fall outside circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical observational study with no free parameters, axioms, or invented entities; relies on standard social science measurement assumptions.

pith-pipeline@v0.9.0 · 5571 in / 1113 out tokens · 44950 ms · 2026-05-13T19:45:10.862343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

In: Proc

Abercrombie,G.,etal.:Temporalandsecondlanguageinfluenceonintra-annotator agreement and stability in hate speech labelling. In: Proc. 17th Linguistic An- notation Workshop (LAW-XVII). pp. 96–103. ACL, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.law-1.10

work page doi:10.18653/v1/2023.law-1.10 2023
[2]

In: Proc

Abercrombie, G., et al.: Consistency is key: Disentangling label variation in nlp with intra-annotator agreement. In: Proc. 4th Workshop on Perspectivist Ap- proaches to NLP. pp. 63–74. ACL, Suzhou, China (2025).https://doi.org/10. 18653/v1/2025.nlperspectives-1.6

work page 2025
[3]

ISBN 979-8-89176-256-5

Bassi, D., et al.: Annotating the annotators: Analysis, insights and modelling from an annotation campaign on persuasion techniques detection. In: Findings of ACL. pp. 17918–17929. ACL, Vienna, Austria (2025).https://doi.org/10.18653/v1/ 2025.findings-acl.922

work page doi:10.18653/v1/ 2025
[4]

Qualitative Research in Psychology3(2), 77–101 (2006)

Braun, V., Clarke, V.: Using thematic analysis in psychology. Qualitative Research in Psychology3(2), 77–101 (2006)

work page 2006
[5]

Family Medicine and Community Health7(2) (2019), e000057

DeJonckheere, M., Vaughn, L.M.: Semistructured interviewing in primary care research. Family Medicine and Community Health7(2) (2019), e000057

work page 2019
[6]

In: Proc

Doroudi, S., et al.: Toward a learning science for complex crowdsourcing tasks. In: Proc. CHI. p. 2623–2634. ACM, New York, NY, USA (2016).https://doi.org/ 10.1145/2858036.2858268

work page doi:10.1145/2858036.2858268 2016
[7]

In: Proc

Fleisig, E., et al.: The perspectivist paradigm shift: Assumptions and challenges of capturing human labels. In: Proc. NAACL-HLT. pp. 2279–2292. ACL, Mexico City, Mexico (2024).https://doi.org/10.18653/v1/2024.naacl-long.126

work page doi:10.18653/v1/2024.naacl-long.126 2024
[8]

Hart and Lowell E

Hart,S.G.,Staveland,L.E.:DevelopmentoftheNASA-TLX(TaskLoadIndex).In: Human Mental Workload. Advances in Psychology, vol. 52, pp. 139–183. Elsevier, Amsterdam (1988).https://doi.org/10.1016/S0166-4115(08)62386-9

work page doi:10.1016/s0166-4115(08)62386-9 1988
[9]

In: Proc

Hata, K., et al.: A glimpse far into the future: Understanding long-term crowd worker quality. In: Proc. CSCW. pp. 889–901. ACM (2017).https://doi.org/ 10.1145/2998181.2998248

work page doi:10.1145/2998181.2998248 2017
[10]

Qual- itative Health Research15, 1277–1288 (11 2005).https://doi.org/10.1177/ 1049732305276687 Competence Development in Social Influence Recognition 15

Hsieh, H.F., Shannon, S.: Three approaches to qualitative content analysis. Qual- itative Health Research15, 1277–1288 (11 2005).https://doi.org/10.1177/ 1049732305276687 Competence Development in Social Influence Recognition 15

work page 2005
[11]

Prentice-Hall, Englewood Cliffs, NJ (1984)

Kolb, D.A.: Experiential Learning: Experience as the Source of Learning and De- velopment. Prentice-Hall, Englewood Cliffs, NJ (1984)

work page 1984
[12]

Lee, J.U., et al.: Annotation curricula to implicitly train non-expert annotators. Comput. Linguist.48, 343–373 (2022).https://doi.org/10.1162/coli_a_00436

work page doi:10.1162/coli_a_00436 2022
[13]

Mohamed, R., et al.: Validation of the NASA-TLX to evaluate the learning curve for endoscopy training. Can. J. Gastroenterol. Hepatol.28, 892476 (2014).https: //doi.org/10.1155/2014/892476

work page doi:10.1155/2014/892476 2014
[14]

In: Proc

Mokhberian, N., et al.: Capturing perspectives of crowdsourced annotators in sub- jective learning tasks. In: Proc. NAACL-HLT. pp. 7337–7349. ACL, Mexico City, Mexico (2024).https://doi.org/10.18653/v1/2024.naacl-long.407

work page doi:10.18653/v1/2024.naacl-long.407 2024
[15]

Renkl, A.: Toward an instructionally oriented theory of example-based learning. Cognit. Sci.38, 1–37 (2014).https://doi.org/10.1111/cogs.12086

work page doi:10.1111/cogs.12086 2014
[16]

Tai, R.H., et al.: An examination of the use of large language models to aid analysis of textual data. Int. J. Qual. Methods23, 16094069241231168 (2024).https:// doi.org/10.1177/16094069241231168

work page doi:10.1177/16094069241231168 2024
[17]

Cambridge University Press (2021)

Vitello, S., et al.: What is competence? A shared interpretation of competence to support teaching, learning and assessment. Cambridge University Press (2021). https://doi.org/10.17863/CAM.110829

work page doi:10.17863/cam.110829 2021
[18]

Yoo, H., et al.: Rethinking annotation: Can language learners contribute? In: Proc. ACL. pp. 14714–14733. ACL, Toronto, Canada (2023).https://doi.org/ 10.18653/v1/2023.acl-long.822

work page doi:10.18653/v1/2023.acl-long.822 2023

[1] [1]

In: Proc

Abercrombie,G.,etal.:Temporalandsecondlanguageinfluenceonintra-annotator agreement and stability in hate speech labelling. In: Proc. 17th Linguistic An- notation Workshop (LAW-XVII). pp. 96–103. ACL, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.law-1.10

work page doi:10.18653/v1/2023.law-1.10 2023

[2] [2]

In: Proc

Abercrombie, G., et al.: Consistency is key: Disentangling label variation in nlp with intra-annotator agreement. In: Proc. 4th Workshop on Perspectivist Ap- proaches to NLP. pp. 63–74. ACL, Suzhou, China (2025).https://doi.org/10. 18653/v1/2025.nlperspectives-1.6

work page 2025

[3] [3]

ISBN 979-8-89176-256-5

Bassi, D., et al.: Annotating the annotators: Analysis, insights and modelling from an annotation campaign on persuasion techniques detection. In: Findings of ACL. pp. 17918–17929. ACL, Vienna, Austria (2025).https://doi.org/10.18653/v1/ 2025.findings-acl.922

work page doi:10.18653/v1/ 2025

[4] [4]

Qualitative Research in Psychology3(2), 77–101 (2006)

Braun, V., Clarke, V.: Using thematic analysis in psychology. Qualitative Research in Psychology3(2), 77–101 (2006)

work page 2006

[5] [5]

Family Medicine and Community Health7(2) (2019), e000057

DeJonckheere, M., Vaughn, L.M.: Semistructured interviewing in primary care research. Family Medicine and Community Health7(2) (2019), e000057

work page 2019

[6] [6]

In: Proc

Doroudi, S., et al.: Toward a learning science for complex crowdsourcing tasks. In: Proc. CHI. p. 2623–2634. ACM, New York, NY, USA (2016).https://doi.org/ 10.1145/2858036.2858268

work page doi:10.1145/2858036.2858268 2016

[7] [7]

In: Proc

Fleisig, E., et al.: The perspectivist paradigm shift: Assumptions and challenges of capturing human labels. In: Proc. NAACL-HLT. pp. 2279–2292. ACL, Mexico City, Mexico (2024).https://doi.org/10.18653/v1/2024.naacl-long.126

work page doi:10.18653/v1/2024.naacl-long.126 2024

[8] [8]

Hart and Lowell E

Hart,S.G.,Staveland,L.E.:DevelopmentoftheNASA-TLX(TaskLoadIndex).In: Human Mental Workload. Advances in Psychology, vol. 52, pp. 139–183. Elsevier, Amsterdam (1988).https://doi.org/10.1016/S0166-4115(08)62386-9

work page doi:10.1016/s0166-4115(08)62386-9 1988

[9] [9]

In: Proc

Hata, K., et al.: A glimpse far into the future: Understanding long-term crowd worker quality. In: Proc. CSCW. pp. 889–901. ACM (2017).https://doi.org/ 10.1145/2998181.2998248

work page doi:10.1145/2998181.2998248 2017

[10] [10]

Qual- itative Health Research15, 1277–1288 (11 2005).https://doi.org/10.1177/ 1049732305276687 Competence Development in Social Influence Recognition 15

Hsieh, H.F., Shannon, S.: Three approaches to qualitative content analysis. Qual- itative Health Research15, 1277–1288 (11 2005).https://doi.org/10.1177/ 1049732305276687 Competence Development in Social Influence Recognition 15

work page 2005

[11] [11]

Prentice-Hall, Englewood Cliffs, NJ (1984)

Kolb, D.A.: Experiential Learning: Experience as the Source of Learning and De- velopment. Prentice-Hall, Englewood Cliffs, NJ (1984)

work page 1984

[12] [12]

Lee, J.U., et al.: Annotation curricula to implicitly train non-expert annotators. Comput. Linguist.48, 343–373 (2022).https://doi.org/10.1162/coli_a_00436

work page doi:10.1162/coli_a_00436 2022

[13] [13]

Mohamed, R., et al.: Validation of the NASA-TLX to evaluate the learning curve for endoscopy training. Can. J. Gastroenterol. Hepatol.28, 892476 (2014).https: //doi.org/10.1155/2014/892476

work page doi:10.1155/2014/892476 2014

[14] [14]

In: Proc

Mokhberian, N., et al.: Capturing perspectives of crowdsourced annotators in sub- jective learning tasks. In: Proc. NAACL-HLT. pp. 7337–7349. ACL, Mexico City, Mexico (2024).https://doi.org/10.18653/v1/2024.naacl-long.407

work page doi:10.18653/v1/2024.naacl-long.407 2024

[15] [15]

Renkl, A.: Toward an instructionally oriented theory of example-based learning. Cognit. Sci.38, 1–37 (2014).https://doi.org/10.1111/cogs.12086

work page doi:10.1111/cogs.12086 2014

[16] [16]

Tai, R.H., et al.: An examination of the use of large language models to aid analysis of textual data. Int. J. Qual. Methods23, 16094069241231168 (2024).https:// doi.org/10.1177/16094069241231168

work page doi:10.1177/16094069241231168 2024

[17] [17]

Cambridge University Press (2021)

Vitello, S., et al.: What is competence? A shared interpretation of competence to support teaching, learning and assessment. Cambridge University Press (2021). https://doi.org/10.17863/CAM.110829

work page doi:10.17863/cam.110829 2021

[18] [18]

Yoo, H., et al.: Rethinking annotation: Can language learners contribute? In: Proc. ACL. pp. 14714–14733. ACL, Toronto, Canada (2023).https://doi.org/ 10.18653/v1/2023.acl-long.822

work page doi:10.18653/v1/2023.acl-long.822 2023