Annotation Quality in Aspect-Based Sentiment Analysis: A Case Study Comparing Experts, Students, Crowdworkers, and Large Language Model
Pith reviewed 2026-05-07 04:14 UTC · model grok-4.3
The pith
Different sources of annotation for German aspect-based sentiment analysis produce datasets of varying quality that affect how well state-of-the-art models perform on fine-grained sentiment tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Re-annotating an existing German ABSA dataset with experts creates a ground truth against which annotations from students, crowdworkers, and LLMs are compared using inter-annotator agreement and downstream performance on aspect category sentiment analysis and target aspect sentiment detection. State-of-the-art models based on BERT, T5, and LLaMA are trained and evaluated under both fine-tuning and in-context learning settings to quantify how annotation source changes task accuracy. The resulting measurements show clear differences in consistency and model effectiveness that point to concrete trade-offs between annotation reliability and the effort required to obtain it.
What carries the argument
Inter-annotator agreement metrics combined with controlled evaluation of model performance on ACSA and TASD subtasks after training on each annotation source.
If this is right
- Models trained on expert annotations reach higher accuracy on German ACSA and TASD than models trained on the other sources.
- Large language models can supply annotations at lower cost but introduce more variability that reduces final model performance.
- Student and crowdworker annotations occupy an intermediate position in both agreement and downstream results.
- Dataset builders for other under-resourced languages can use these measured trade-offs to choose annotation methods that match available time and budget.
Where Pith is reading between the lines
- Hybrid workflows that let LLMs generate initial labels and then route uncertain cases to experts could reduce total cost while preserving most of the quality gain.
- The same source-comparison design could be applied to other sequence-labeling tasks such as named-entity recognition or opinion-target extraction in low-resource settings.
- Performance differences observed on held-out test data may shrink or grow when the models are tested on entirely new German text from different domains.
Load-bearing premise
That the expert re-annotations form an unbiased and stable reference that fairly represents the correct labels for the texts.
What would settle it
A second independent group of experts re-annotating the same texts and producing labels that agree more with student or LLM annotations than with the first expert set would undermine the chosen ground truth.
Figures
read the original abstract
Aspect-Based Sentiment Analysis (ABSA) enables fine-grained opinion analysis by identifying sentiments toward specific aspects or targets within a text. While ABSA has been widely studied for English, research on other languages such as German remains limited, largely due to the lack of high-quality annotated datasets. This paper examines how different annotation sources influence the development of German ABSA. To this end, an existing dataset is re-annotated by experts to establish a ground truth, which serves as a reference for evaluating annotations produced by students, crowdworkers, Large Language Models (LLMs), and experts. Annotation quality is compared using Inter-Annotator Agreement (IAA) and its impact on downstream model performance for different ABSA subtasks. The evaluation focuses on Aspect Category Sentiment Analysis (ACSA) and Target Aspect Sentiment Detection (TASD). We apply State-of-the-Art (SOTA) methods for ABSA, including BERT-, T5-, and LLaMA-based approaches to assess performance differences, spanning fine-tuning and in-context learning with instruction prompts. The findings provide practical insights into trade-offs between annotation reliability and efficiency, offering guidance for dataset construction in under-resourced Natural Language Processing (NLP) scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that re-annotating an existing German ABSA dataset with experts establishes a reliable ground truth against which student, crowdworker, and LLM annotations can be compared via IAA and downstream performance on ACSA and TASD subtasks; SOTA models (BERT, T5, LLaMA) in fine-tuning and in-context learning settings then reveal practical reliability-efficiency trade-offs for dataset construction in under-resourced NLP.
Significance. If the expert ground truth is validated, the work supplies concrete guidance for low-resource ABSA dataset creation by quantifying when cheaper sources (LLMs, crowdworkers) can substitute for experts without substantial downstream loss. The dual IAA-plus-task-performance evaluation is a strength that could inform annotation protocols beyond German ABSA.
major comments (1)
- [Abstract and §3] Abstract and §3 (Annotation and Evaluation Setup): The central comparisons rest on expert re-annotations as the reference standard, yet no IAA among experts, disagreement-resolution protocol, or comparison to the original dataset labels is reported. Without these metrics the observed gaps for students, crowdworkers, and LLMs cannot be confidently attributed to quality rather than reference bias or task-interpretation differences; this assumption is load-bearing for all trade-off claims.
minor comments (1)
- [Abstract] The abstract mentions 'SOTA methods' and 'instruction prompts' without naming the exact model variants, prompt templates, or hyper-parameters used in the in-context learning experiments; this reduces reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights an important aspect of our methodology. We address the major comment below and will incorporate revisions to strengthen the validation of our expert ground truth.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Annotation and Evaluation Setup): The central comparisons rest on expert re-annotations as the reference standard, yet no IAA among experts, disagreement-resolution protocol, or comparison to the original dataset labels is reported. Without these metrics the observed gaps for students, crowdworkers, and LLMs cannot be confidently attributed to quality rather than reference bias or task-interpretation differences; this assumption is load-bearing for all trade-off claims.
Authors: We agree that these details are necessary to rigorously establish the expert re-annotations as a reliable reference standard and to rule out reference bias. In the revised manuscript, we will expand §3 to report inter-annotator agreement among the expert annotators (using metrics such as Fleiss' kappa for multi-label aspects and sentiments), describe the disagreement-resolution protocol (e.g., discussion rounds leading to consensus), and include a direct comparison of the expert re-annotations against the original dataset labels, quantifying differences in aspect coverage and sentiment assignments. These additions will allow readers to assess whether performance gaps for students, crowdworkers, and LLMs stem from annotation quality rather than inconsistencies in the ground truth, thereby reinforcing the validity of our reliability-efficiency trade-off claims for ACSA and TASD. revision: yes
Circularity Check
No significant circularity: empirical comparison anchored to expert reference without self-referential derivations
full rationale
The paper presents a purely empirical study that re-annotates an existing German ABSA dataset with experts to create a reference, then measures IAA and downstream ACSA/TASD model performance for student, crowdworker, and LLM annotations against that reference. No equations, fitted parameters, or predictive derivations appear in the described method; the evaluation pipeline does not reduce any output to its inputs by construction. Standard IAA metrics and fine-tuned/in-context model runs constitute independent benchmarks rather than tautological restatements. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify the core claims. The design is self-contained against the external expert annotations as reference, which is a conventional methodological choice in annotation-quality research and does not create circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotations of the existing dataset provide a reliable ground truth for ABSA.
Reference graph
Works this paper leans on
-
[1]
Toufique and Al Omar, Abdullah and Bhuiyan, Hanif , month = jun, year =
Ara, Jinat and Hasan, Md. Toufique and Al Omar, Abdullah and Bhuiyan, Hanif , month = jun, year =. Understanding. 2020. doi:10.1109/TENSYMP50017.2020.9230712 , abstract =
-
[2]
Bai, Yinhao and Han, Zhixin and Zhao, Yuhua and Gao, Hang and Zhang, Zhuowei and Wang, Xunzhi and Hu, Mengting , editor =. Is. Findings of the. 2024 , pages =. doi:10.18653/v1/2024.findings-emnlp.460 , abstract =
-
[3]
Basile, Pierpaolo and Croce, Danilo and Basile, Valerio and Polignano, Marco , year =. Overview of the
-
[4]
Bhoi, Amlaan and Joshi, Sandeep , month = may, year =. Various. doi:10.48550/arXiv.1805.01984 , abstract =
-
[5]
A. ACM Comput. Surv. , author =. 2022 , pages =. doi:10.1145/3503044 , abstract =
-
[6]
Brun, Caroline and Nikoulina, Vassilina , editor =. Aspect. Proceedings of the 9th. 2018 , pages =. doi:10.18653/v1/W18-6217 , abstract =
-
[7]
Bu, Jiahao and Ren, Lei and Zheng, Shuang and Yang, Yang and Wang, Jingang and Zhang, Fuzheng and Wu, Wei , booktitle=
-
[8]
Cai, Hongjie and Xia, Rui and Yu, Jianfei , editor =. Aspect-. Proceedings of the 59th. 2021 , pages =. doi:10.18653/v1/2021.acl-long.29 , abstract =
-
[9]
Computer Science Review , author =
Aspect based sentiment analysis using deep learning approaches:. Computer Science Review , author =. 2023 , keywords =. doi:10.1016/j.cosrev.2023.100576 , abstract =
-
[10]
Chebolu, Siva Uday Sampreeth and Dernoncourt, Franck and Lipka, Nedim and Solorio, Thamar , year =. Proceedings of the 13th
-
[11]
Cheng, Jiajun and Zhao, Shenglin and Zhang, Jiani and King, Irwin and Zhang, Xin and Wang, Hui , month = nov, year =. Aspect-level. Proceedings of the 2017. doi:10.1145/3132847.3133037 , abstract =
-
[12]
Clematide, Simon and Gindl, Stefan and Klenner, Manfred and Petrakis, Stefanos and Remus, Robert and Ruppenhofer, Josef and Waltinger, Ulli and Wiegand, Michael , booktitle =
-
[13]
A. Educational and Psychological Measurement , author =. 1960 , pages =. doi:10.1177/001316446002000104 , language =
-
[14]
Colucci Cante, Luigi and D’Angelo, Salvatore and Di Martino, Beniamino and Graziano, Mariangela , editor =. Text. Complex,. 2024 , pages =. doi:10.1007/978-3-031-70011-8_33 , abstract =
-
[15]
de França Costa, Dayan and da Silva, Nadia Felix Felipe , month = apr, year =. Companion. doi:10.1145/3184558.3191828 , abstract =
-
[16]
De Mattei, Lorenzo and De Martino, Graziella and Iovine, Andrea and Miaschi, Alessio and Polignano, Marco and Rambelli, Giulia , year =
-
[17]
Ding, Xiaowen and Liu, Bing and Yu, Philip S. , month = feb, year =. A holistic lexicon-based approach to opinion mining , isbn =. Proceedings of the 2008. doi:10.1145/1341531.1341561 , abstract =
-
[18]
Adaptive recursive neural network for target-dependent twitter sentiment classification , url =
Dong, Li and Wei, Furu and Tan, Chuanqi and Tang, Duyu and Zhou, Ming and Xu, Ke , year =. Adaptive recursive neural network for target-dependent twitter sentiment classification , url =. Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2:
-
[19]
Fan, Zhifang and Wu, Zhen and Dai, Xin-Yu and Huang, Shujian and Chen, Jiajun , editor =. Target-oriented. Proceedings of the 2019. 2019 , pages =. doi:10.18653/v1/N19-1259 , abstract =
-
[20]
Fehle, Jakob and Donhauser, Niklas and Kruschwitz, Udo and Hellwig, Nils Constantin and Wolff, Christian , year =. German. 21st
-
[21]
Fehle, Jakob and Münster, Leonie and Schmidt, Thomas and Wolff, Christian , year =. Aspect-. Proceedings of the 19th conference on natural language processing (konvens 2023) , pages=
work page 2023
- [22]
-
[23]
Fisher, R. A. , editor =. Statistical. Breakthroughs in. 1992 , doi =
work page 1992
-
[24]
Measuring nominal scale agreement among many raters. , volume =. Psychological bulletin , author =. 1971 , note =
work page 1971
-
[25]
Journal of the American Statistical Association , author =
The. Journal of the American Statistical Association , author =. 1937 , pages =. doi:10.1080/01621459.1937.10503522 , language =
-
[26]
Gabryszak, Aleksandra and Thomas, Philippe , year =. Mob. Proceedings of the
-
[27]
M v P : Multi-view Prompting Improves Aspect Sentiment Tuple Prediction
Gou, Zhibin and Guo, Qingyan and Yang, Yujiu , editor =. Proceedings of the 61st. 2023 , pages =. doi:10.18653/v1/2023.acl-long.240 , abstract =
-
[28]
Metrics for multi-class classifi- cation: an overview,
Grandini, Margherita and Bagli, Enrico and Visani, Giorgio , month = aug, year =. Metrics for. doi:10.48550/arXiv.2008.05756 , abstract =
-
[29]
Computational Linguistics in the Netherlands Journal , author =
Aspect-based. Computational Linguistics in the Netherlands Journal , author =. 2021 , pages =
work page 2021
-
[30]
Hamborg, Felix and Donnay, Karsten and Merlo, Paola , year =
-
[31]
Hellwig, Nils Constantin and Fehle, Jakob and Bink, Markus and Wolff, Christian , booktitle=
- [32]
-
[33]
Mining and summarizing customer reviews , isbn =
Hu, Minqing and Liu, Bing , month = aug, year =. Mining and summarizing customer reviews , isbn =. Proceedings of the tenth. doi:10.1145/1014052.1014073 , urldate =
-
[34]
Artificial Intelligence Review , author =
A systematic review of aspect-based sentiment analysis: domains, methods, and trends , volume =. Artificial Intelligence Review , author =. 2024 , keywords =. doi:10.1007/s10462-024-10906-z , abstract =
-
[35]
Jiang, Qingnan and Chen, Lei and Xu, Ruifeng and Ao, Xiang and Yang, Min , editor =. A. Proceedings of the 2019. 2019 , pages =. doi:10.18653/v1/D19-1654 , abstract =
-
[36]
Jun, Yonghyun and Lee, Hwanhee , editor =. Dynamic. Proceedings of the 63rd. 2025 , pages =. doi:10.18653/v1/2025.acl-short.48 , abstract =
-
[37]
and Eckert, Miriam and Clark, Lyndsie and Nicolov, Nicolas , year =
Kessler, Jason S. and Eckert, Miriam and Clark, Lyndsie and Nicolov, Nicolas , year =. The. Proceedings of the 4th
-
[38]
Klie, Jan-Christoph and Bugert, Michael and Boullosa, Beto and Eckart de Castilho, Richard and Gurevych, Iryna , editor =. The. Proceedings of the 27th. 2018 , pages =
work page 2018
- [39]
- [40]
-
[41]
Lee, Lung-Hao and Yu, Liang-Chih and Wang, Suge and Liao, Jian , editor =. Overview of the. Proceedings of the 10th. 2024 , pages =
work page 2024
-
[42]
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019) , year=
Exploiting BERT for End-to-End Aspect-based Sentiment Analysis , author=. Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019) , year=
work page 2019
-
[43]
A more fine-grained aspect--sentiment--opinion triplet extraction task , author=. Mathematics , abstract =. 2023 , publisher=
work page 2023
- [44]
-
[45]
Automated rule selection for aspect extraction in opinion mining , url =
Liu, Qian and Gao, Zhiqiang and Liu, Bing and Zhang, Yuanlin , year =. Automated rule selection for aspect extraction in opinion mining , url =. Twenty-
-
[46]
Efficient Hybrid Generation Framework for Aspect-Based Sentiment Analysis
Lv, Haoran and Liu, Junyi and Wang, Henan and Wang, Yaoming and Luo, Jixiang and Liu, Yaxiao , editor =. Efficient. Proceedings of the 17th. 2023 , pages =. doi:10.18653/v1/2023.eacl-main.71 , urldate =
-
[47]
Minaee, Shervin and Mikolov, Tomas and Nikzad, Narjes and Chenaghlu, Meysam and Socher, Richard and Amatriain, Xavier and Gao, Jianfeng , month = feb, year =. Large. doi:10.48550/arXiv.2402.06196 , abstract =
work page internal anchor Pith review doi:10.48550/arxiv.2402.06196
- [48]
-
[49]
AIP Conference Proceedings , author =
Aspect-based sentiment analysis to review products using. AIP Conference Proceedings , author =. 2017 , pages =. doi:10.1063/1.4994463 , abstract =
-
[50]
Mughal, Nimra and Mujtaba, Ghulam and Shaikh, Sarang and Kumar, Aveenash and Daudpota, Sher Muhammad , journal=. Comparative Analysis of Deep Natural Networks and Large Language Models for Aspect-Based Sentiment Analysis , year=
-
[51]
New Media & Society , author =
The social construction of datasets:. New Media & Society , author =. 2024 , note =. doi:10.1177/14614448241251797 , abstract =
-
[52]
Proceedings of the AAAI Conference on Artificial Intelligence , author =
Knowing. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2020 , note =. doi:10.1609/aaai.v34i05.6383 , abstract =
-
[53]
International Journal of Approximate Reasoning , author =
Exploiting multiple word embeddings and one-hot character vectors for aspect-based sentiment analysis , volume =. International Journal of Approximate Reasoning , author =. 2018 , keywords =. doi:10.1016/j.ijar.2018.08.003 , abstract =
-
[54]
International workshop on semantic evaluation , author =
Semeval-2016 task 5:. International workshop on semantic evaluation , author =. 2016 , pages =
work page 2016
-
[55]
S em E val-2015 Task 12: Aspect Based Sentiment Analysis
Pontiki, Maria and Galanis, Dimitris and Papageorgiou, Haris and Manandhar, Suresh and Androutsopoulos, Ion , editor =. Proceedings of the 9th. 2015 , pages =. doi:10.18653/v1/S15-2082 , urldate =
-
[56]
S em E val-2014 Task 4: Aspect Based Sentiment Analysis
Pontiki, Maria and Galanis, Dimitris and Pavlopoulos, John and Papageorgiou, Harris and Androutsopoulos, Ion and Manandhar, Suresh , editor =. Proceedings of the 8th. 2014 , pages =. doi:10.3115/v1/S14-2004 , urldate =
-
[57]
Datasets for. Data , author =. 2018 , note =. doi:10.3390/data3020015 , language =
-
[58]
Regatte, Yashwanth Reddy and Gangula, Rama Rohit Reddy and Mamidi, Radhika , editor =. Dataset. Proceedings of the. 2020 , pages =
work page 2020
-
[59]
Sadia, Azeema and Khan, Fariha and Bashir, Fatima , year=. An. 2018 3rd International electrical engineering conference (IEEC 2018) , pages=
work page 2018
-
[60]
SentiHood: Targeted Aspect Based Sentiment Analysis Dataset for Urban Neighbourhoods , author=. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers , abstract =
work page 2016
-
[61]
Public opinion quarterly , author =
Reliability of content analysis:. Public opinion quarterly , author =. 1955 , note =
work page 1955
-
[62]
An analysis of variance test for normality (complete samples) , volume =. Biometrika , author =. 1965 , note =
work page 1965
-
[63]
Sidarenka, Uladzimir , editor =. Proceedings of the. 2016 , pages =
work page 2016
-
[64]
Simmering and Paavo Huoviala , title =
Large language models for aspect-based sentiment analysis , url =. arXiv preprint arXiv:2310.18025 , author =. 2023 , keywords =. doi:10.48550/arXiv.2310.18025 , abstract =
-
[65]
Singhi, Vishal and Chauhan, Charulata and Soni, Piyush Kumar , month = apr, year =. Exploring. 2024. doi:10.1109/I2CT61223.2024.10543612 , abstract =
-
[66]
Aspect-level sentiment analysis in czech , url =. Proceedings of the 5th workshop on computational approaches to subjectivity, sentiment and social media analysis , author =. 2014 , pages =
work page 2014
-
[67]
Stenetorp, Pontus and Pyysalo, Sampo and Topić, Goran and Ohta, Tomoko and Ananiadou, Sophia and Tsujii, Jun'ichi , editor =. brat: a. Proceedings of the. 2012 , pages =
work page 2012
-
[68]
The probable error of a mean , url =. Biometrika , author =. 1908 , note =
work page 1908
-
[69]
Sänger, Mario and Kemmerer, Steffen and Adolphs, Peter and Klinger, Roman and Leser, Ulf , year =. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) , pages=
-
[70]
Tong, Zeliang and Wei, Wei , editor =. Proceedings of the 10th. 2024 , pages =
work page 2024
-
[71]
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Ł ukasz and Polosukhin, Illia , year =. Attention is. Advances in
- [72]
-
[73]
Latent aspect rating analysis without aspect keyword supervision , isbn =
Wang, Hongning and Lu, Yue and Zhai, ChengXiang , month = aug, year =. Latent aspect rating analysis without aspect keyword supervision , isbn =. Proceedings of the 17th. doi:10.1145/2020408.2020505 , abstract =
-
[74]
Wang, Zengzhi and Xie, Qiming and Xia, Rui , month = jul, year =. A. Proceedings of the 46th. doi:10.1145/3539618.3591940 , abstract =
-
[75]
Applied Soft Computing , author =
A survey on aspect base sentiment analysis methods and challenges , volume =. Applied Soft Computing , author =. 2024 , keywords =. doi:10.1016/j.asoc.2024.112249 , abstract =
-
[76]
Saarland University’s participation in the German sentiment analysis shared task (GESTALT) , author=. Workshop Proceedings of the 12th Edition of the KONVENS Conference, Hildesheim, Germany, October 8-10, 2014 , pages=. 2014 , organization=
work page 2014
- [77]
-
[78]
Evaluation of an algorithm for aspect-based opinion mining using a lexicon-based approach , isbn =
Wogenstein, Florian and Drescher, Johannes and Reinel, Dirk and Rill, Sven and Scheidt, Jörg , month = aug, year =. Evaluation of an algorithm for aspect-based opinion mining using a lexicon-based approach , isbn =. Proceedings of the. doi:10.1145/2502069.2502074 , abstract =
-
[79]
Wu, ChengYan and Ma, Bolei and Liu, Yihong and Zhang, Zheyu and Deng, Ningyuan and Li, Yanshu and Chen, Baolan and Zhang, Yi and Xue, Yun and Plank, Barbara , editor =. M-. Proceedings of the 2025. 2025 , pages =. doi:10.18653/v1/2025.emnlp-main.128 , abstract =
-
[80]
Xu, Hongling and Zhang, Delong and Zhang, Yice and Xu, Ruifeng , editor =. Proceedings of the 10th. 2024 , pages =
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.