Sentiment analysis for software engineering: How far can zero-shot learning (ZSL) go?

Manal Binkhonain; Reem Alfayez

arxiv: 2604.13826 · v1 · submitted 2026-04-15 · 💻 cs.SE · cs.AI

Sentiment analysis for software engineering: How far can zero-shot learning (ZSL) go?

Reem Alfayez , Manal Binkhonain This is my paper

Pith reviewed 2026-05-10 12:44 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords sentiment analysiszero-shot learningsoftware engineeringnatural language processingtext classificationmachine learningannotated data scarcity

0 comments

The pith

Zero-shot learning techniques can match the performance of fine-tuned models in software engineering sentiment analysis using expert-curated labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines the use of zero-shot learning to perform sentiment analysis on software engineering artifacts without needing large amounts of labeled data. It evaluates several zero-shot methods including embedding-based, generative-based, and others across different label setups and compares them to fine-tuned transformer models. The key finding is that certain zero-shot approaches reach similar macro-F1 scores to the supervised methods when paired with expert labels. Readers would care because obtaining annotated datasets for this domain is costly and requires specialized knowledge. The study also analyzes errors, finding that subjective annotations and statements of fact often lead to mistakes in zero-shot classifications.

Core claim

The study demonstrates that zero-shot learning techniques, particularly those that combine expert-curated labels with embedding-based or generative-based models, can achieve macro-F1 scores comparable to those of fine-tuned transformer-based models for sentiment analysis in software engineering. This capability addresses the challenge of annotated dataset scarcity by reducing the need for extensive domain-specific labeling efforts.

What carries the argument

Zero-shot learning applied to sentiment classification tasks, where models classify text into sentiment categories using pre-trained knowledge and label descriptions without task-specific fine-tuning data.

If this is right

Zero-shot learning provides a viable alternative to supervised learning for sentiment analysis in software engineering.
Expert-curated labels significantly boost the performance of embedding-based and generative zero-shot methods.
Different configurations of labels influence the effectiveness of zero-shot techniques.
Subjectivity in annotations and polar factual statements are primary sources of classification errors.
Adopting zero-shot methods can lower the barrier to developing sentiment analysis tools tailored to software engineering contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Zero-shot learning could be applied to other label-scarce tasks in software engineering such as defect prediction or requirement classification.
Integrating zero-shot methods with active learning might further reduce the amount of expert input needed.
Results suggest that improving label quality could be more impactful than refining the zero-shot models themselves.
Broader adoption might enable real-time sentiment monitoring in large code repositories without prior training.

Load-bearing premise

The tested datasets and zero-shot implementations are representative of typical software engineering sentiment analysis scenarios.

What would settle it

Observing substantially lower macro-F1 scores for the best zero-shot methods compared to fine-tuned models on a new, independently collected software engineering dataset would indicate the comparability does not hold generally.

Figures

Figures reproduced from arXiv: 2604.13826 by Manal Binkhonain, Reem Alfayez.

**Figure 1.** Figure 1: illustrates the process of embedding-based ZSL text classification. Both the input text and potential class labels are passed through a pre-trained LLM to generate embeddings. Classification is performed by calculating the cosine similarity between the input text embedding and each class label embedding. The class with the highest similarity score is then selected as the predicted label (i.e., label 1 in… view at source ↗

**Figure 2.** Figure 2: An illustration of NLI-based ZSL 2.3 Task-aware representation of sentences (TARS)- based ZSL TARS formulates the classification task as a universal binary classification problem, where the model learns to predict whether a given text belongs to a particular label or not. Instead of training separate models for each label, TARS simultaneously evaluates the relevance of the text for all labels by adapting L… view at source ↗

**Figure 3.** Figure 3: An illustration of TARS-based ZSL 2.4 Generative-based ZSL Transformer-based generative models, such as OpenAI ’s Generative PreTrained Transformers (GPTs) 1 , are capable to perform ZSL text classifica1https://openai.com/ 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: An illustration of generative-based ZSL 3 Related work Many studies have assessed sentiment analysis tools, explored the impact of sentiment on software development practices, and more. Due to space constraints, we focus on summarizing (1) systematic reviews related to sentiment analysis tools in software engineering and (2) research efforts on the development of such tools. 3.1 Systematic reviews on sent… view at source ↗

**Figure 5.** Figure 5: Scott-Knott ESD ranking for ZSL models based on macro-F1 score [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Scott-Knott ESD ranking for embedding-based model-label com [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Scott-Knott ESD ranking for NLI-based model-label combinations [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Scott-Knott ESD ranking for the TARS model-label combinations [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Scott-Knott ESD ranking for the generative model-label combina [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Scott-Knott ESD ranking for model-label combinations based on [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Scott-Knott ESD ranking for the state-of-the-art fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

read the original abstract

Sentiment analysis in software engineering focuses on understanding emotions expressed in software artifacts. Previous research highlighted the limitations of applying general off-the-shelf sentiment analysis tools within the software engineering domain and indicated the need for specialized tools tailored to various software engineering contexts. The development of such tools heavily relies on supervised machine learning techniques that necessitate annotated datasets. Acquiring such datasets is a substantial challenge, as it requires domain-specific expertise and significant effort. Objective: This study explores the potential of ZSL to address the scarcity of annotated datasets in sentiment analysis within software engineering Method:} We conducted an empirical experiment to evaluate the performance of various ZSL techniques, including embedding-based, NLI-based, TARS-based, and generative-based ZSL techniques. We assessed the performance of these techniques under different labels setups to examine the impact of label configurations. Additionally, we compared the results of the ZSL techniques with state-of-the-art fine-tuned transformer-based models. Finally, we performed an error analysis to identify the primary causes of misclassifications. Results: Our findings demonstrate that ZSL techniques, particularly those combining expert-curated labels with embedding-based or generative-based models, can achieve macro-F1 scores comparable to fine-tuned transformer-based models. The error analysis revealed that subjectivity in annotation and polar facts are the main contributors to ZSL misclassifications. Conclusion: This study demonstrates the potential of ZSL for sentiment analysis in software engineering. ZSL can provide a solution to the challenge of annotated dataset scarcity by reducing reliance on annotated dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZSL with expert labels gets macro-F1 close to fine-tuned models on the tested SE datasets, but the comparability rests on those specific setups and label choices.

read the letter

The paper's core result is that embedding-based and generative ZSL methods, when paired with expert-curated labels, reach macro-F1 scores that sit near those of fine-tuned transformers on software engineering sentiment tasks. This directly tackles the annotation bottleneck that has limited domain-specific tools in the field. The work is new in running a head-to-head across multiple ZSL families (embedding, NLI, TARS, generative) against standard fine-tuned baselines, while also testing different label configurations and adding an error analysis. That combination is not in the prior SE sentiment literature. The error breakdown is useful: it points to annotation subjectivity and polar facts as the main misclassification drivers, which gives readers a concrete sense of where these methods still fail. The experimental design looks honest in its scope and avoids obvious circularity. The comparison is empirical and uses existing techniques without self-referential fitting. The central claim holds up on the reported setups, though the numbers themselves are not in the abstract. The soft spot is generalizability. The results depend on the chosen datasets and the expert labels; if those labels are tuned to the data or if the datasets miss typical SE variability, the comparability may not travel. Annotation subjectivity affects the ground truth for both ZSL and the baselines, so it does not automatically make ZSL look better. Without cross-dataset validation or equivalence testing, the claim that ZSL solves the scarcity problem remains tied to the evaluated conditions. This paper is for SE researchers and practitioners who need sentiment tools but cannot afford large annotated sets. It offers a practical starting point rather than a finished solution. A serious editor should send it to peer review. The question is relevant, the empirical work is grounded, and referees can tighten the controls and test broader applicability.

Referee Report

3 major / 1 minor

Summary. The paper claims that zero-shot learning (ZSL) techniques, particularly embedding-based and generative-based models paired with expert-curated labels, can achieve macro-F1 scores comparable to fine-tuned transformer-based models for sentiment analysis in software engineering. It evaluates embedding-based, NLI-based, TARS-based, and generative-based ZSL approaches under varying label setups, compares them empirically to state-of-the-art supervised models, and uses error analysis to attribute misclassifications primarily to annotation subjectivity and polar facts, concluding that ZSL mitigates the need for annotated datasets.

Significance. If the comparability result holds under broader validation, the work would be significant for software engineering by lowering the barrier to sentiment analysis tools, which currently depend on costly domain-specific annotations. It offers empirical guidance on ZSL viability in SE contexts and highlights practical error sources that could inform hybrid approaches, potentially accelerating adoption where labeled data is scarce.

major comments (3)

[Abstract] Abstract: The central claim that ZSL techniques 'can achieve macro-F1 scores comparable to fine-tuned transformer-based models' is presented without any quantitative scores, dataset sizes, label counts, or statistical tests, making it impossible to assess whether the comparability is meaningful or merely directional.
[Error Analysis] Error Analysis: Subjectivity in annotation and polar facts are identified as primary misclassification drivers, yet these issues plausibly affect the ground-truth labels shared by both ZSL methods and the fine-tuned baselines; without inter-annotator agreement statistics or a differential error breakdown, the comparability result risks being an artifact of the chosen annotation process rather than a property of ZSL.
[Conclusion] Conclusion: The claim that ZSL reduces reliance on annotated datasets rests on the untested assumption that the evaluated datasets and expert-curated label configurations are representative of typical SE sentiment variability; the absence of cross-dataset validation or equivalence testing leaves generalizability unsupported.

minor comments (1)

[Abstract] Abstract contains a clear formatting artifact ('Method:} We conducted') with an extraneous closing brace that should be removed for readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of clarity, rigor, and scope that we will address in the revision. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that ZSL techniques 'can achieve macro-F1 scores comparable to fine-tuned transformer-based models' is presented without any quantitative scores, dataset sizes, label counts, or statistical tests, making it impossible to assess whether the comparability is meaningful or merely directional.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the strength of the comparability claim. In the revised version, we will incorporate key quantitative results from the experiments, including the macro-F1 scores for the best-performing ZSL configurations and the fine-tuned baselines, along with dataset sizes, label counts, and a brief note on the statistical comparisons performed. This change will make the central claim more concrete without altering the manuscript's findings. revision: yes
Referee: [Error Analysis] Error Analysis: Subjectivity in annotation and polar facts are identified as primary misclassification drivers, yet these issues plausibly affect the ground-truth labels shared by both ZSL methods and the fine-tuned baselines; without inter-annotator agreement statistics or a differential error breakdown, the comparability result risks being an artifact of the chosen annotation process rather than a property of ZSL.

Authors: This observation is fair and points to a limitation in our error analysis. Because both ZSL and fine-tuned models are assessed against identical ground-truth labels, the performance comparison remains valid as a measure of how each method performs on the same (potentially noisy) annotations typical of SE sentiment data. We did not compute or report inter-annotator agreement because the datasets originate from prior published studies in which such statistics were not provided. We will add a dedicated limitations paragraph acknowledging this and expand the error analysis section to include a side-by-side comparison of error categories across ZSL and supervised models. This will clarify that the identified error sources are task-inherent rather than method-specific. revision: partial
Referee: [Conclusion] Conclusion: The claim that ZSL reduces reliance on annotated datasets rests on the untested assumption that the evaluated datasets and expert-curated label configurations are representative of typical SE sentiment variability; the absence of cross-dataset validation or equivalence testing leaves generalizability unsupported.

Authors: We accept that the conclusion overstates generalizability. Our evaluation was conducted on multiple established SE sentiment datasets using expert-curated labels, yet we did not perform explicit cross-dataset validation or statistical equivalence testing. In the revised conclusion, we will explicitly qualify the claims to reflect the scope of the datasets and label setups examined in this study, while recommending broader validation as future work. This revision ensures the conclusion accurately represents the empirical evidence presented. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical head-to-head evaluation of ZSL techniques

full rationale

The paper reports an empirical study that runs multiple ZSL variants (embedding-based, NLI-based, TARS-based, generative) on SE sentiment datasets under varying label setups, measures macro-F1, and directly compares the numbers to fine-tuned transformer baselines. No equations, fitted parameters, or predictions are defined in terms of the target result; the comparability claim is the observed experimental outcome, not a quantity forced by construction or by a self-citation chain. Error analysis is post-hoc inspection of misclassifications and does not retroactively define the performance metric. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from machine learning evaluation rather than new postulates; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Pre-trained models for ZSL can transfer to the software engineering sentiment domain without domain-specific fine-tuning
Core premise enabling the comparison to fine-tuned models.
domain assumption Expert-curated labels provide a fair and unbiased basis for evaluating ZSL performance
Invoked when claiming comparability under different label setups.

pith-pipeline@v0.9.0 · 5576 in / 1260 out tokens · 40845 ms · 2026-05-10T12:44:02.368560+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

[1]

Zhang, B

T. Zhang, B. Xu, F. Thung, S. A. Haryono, D. Lo, L. Jiang, Sentiment analysis for software engineering: How far can pre-trained transformer models go?, in: 2020 IEEE Intealefato2018sentimentrnational Confer- ence on Software Maintenance and Evolution (ICSME), IEEE, 2020, pp. 70–80

work page 2020
[2]

Zhang, I

T. Zhang, I. C. Irsan, F. Thung, D. Lo, Revisiting sentiment analy- sis for software engineering in the era of large language models, ACM Transactions on Software Engineering and Methodology 34 (3) (2025) 1–30

work page 2025
[3]

Obaidi, J

M. Obaidi, J. Kl¨ under, Development and application of sentiment anal- ysis tools in software engineering: A systematic literature review, in: Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering, 2021, pp. 80–89

work page 2021
[4]

B. Lin, F. Zampetti, G. Bavota, M. Di Penta, M. Lanza, R. Oliveto, Sentiment analysis for software engineering: How far can we go?, in: Proceedings of the 40th international conference on software engineering, 2018, pp. 94–104

work page 2018
[5]

Sajadi, K

A. Sajadi, K. Damevski, P. Chatterjee, Towards understanding emotions in informal developer interactions: A gitter chat study, in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 2097– 2101

work page 2023
[6]

Novielli, F

N. Novielli, F. Calefato, D. Dongiovanni, D. Girardi, F. Lanubile, Can we use se-specific sentiment analysis tools in a cross-platform setting?, in: Proceedings of the 17th International Conference on Mining Software Repositories, 2020, pp. 158–168. 33

work page 2020
[7]

Uddin, F

G. Uddin, F. Khomh, Automatic mining of opinions expressed about apis in stack overflow, IEEE Transactions on Software Engineering 47 (3) (2019) 522–559

work page 2019
[8]

Calefato, F

F. Calefato, F. Lanubile, F. Maiorano, N. Novielli, Sentiment polarity detection for software development, in: Proceedings of the 40th Inter- national Conference on Software Engineering, 2018, pp. 128–128

work page 2018
[9]

Jongeling, S

R. Jongeling, S. Datta, A. Serebrenik, Choosing your weapons: On sen- timent analysis tools for software engineering research, in: 2015 IEEE International Conference on Software Maintenance and Evolution (IC- SME), IEEE, 2015, pp. 531–535

work page 2015
[10]

Tourani, Y

P. Tourani, Y. Jiang, B. Adams, Monitoring sentiment in open source mailing lists: exploratory study on the apache ecosystem., in: CASCON, Vol. 14, 2014, pp. 34–44

work page 2014
[11]

Novielli, D

N. Novielli, D. Girardi, F. Lanubile, A benchmark study on sentiment analysis for software engineering research, in: Proceedings of the 15th International Conference on Mining Software Repositories, 2018, pp. 364–375

work page 2018
[12]

B. Lin, N. Cassee, A. Serebrenik, G. Bavota, N. Novielli, M. Lanza, Opinion mining for software development: a systematic literature re- view, ACM Transactions on Software Engineering and Methodology (TOSEM) 31 (3) (2022) 1–41

work page 2022
[13]

Tunstall, L

L. Tunstall, L. Von Werra, T. Wolf, Natural language processing with transformers, ” O’Reilly Media, Inc.”, 2022

work page 2022
[14]

Alammar, M

J. Alammar, M. Grootendorst, Hands-On Large Language Models: Lan- guage Understanding and Generation, ” O’Reilly Media, Inc.”, 2024

work page 2024
[15]

S. P. Veeranna, J. Nam, E. L. Mencıa, J. F¨ urnkranz, Using semantic similarity for multi-label zero-shot classification of text documents, in: Proceeding of european symposium on artificial neural networks, com- putational intelligence and machine learning. bruges, belgium: Elsevier, 2016, pp. 423–428

work page 2016
[16]

Alhoshan, A

W. Alhoshan, A. Ferrari, L. Zhao, Zero-shot learning for requirements classification: An exploratory study, Information and Software Technol- ogy 159 (2023) 107202. 34

work page 2023
[17]

W. Yin, J. Hay, D. Roth, Benchmarking zero-shot text classifica- tion: Datasets, evaluation and entailment approach, arXiv preprint arXiv:1909.00161 (2019)

work page arXiv 1909
[18]

Halder, A

K. Halder, A. Akbik, J. Krapac, R. Vollgraf, Task-aware representation of sentences for generic text classification, in: Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 3202– 3213

work page 2020
[19]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901

work page 2020
[20]

S´ anchez-Gord´ on, R

M. S´ anchez-Gord´ on, R. Colomo-Palacios, Taking the emotional pulse of software engineering—a systematic literature review of empirical stud- ies, Information and Software Technology 115 (2019) 23–43

work page 2019
[21]

Obaidi, L

M. Obaidi, L. Nagel, A. Specht, J. Kl¨ under, Sentiment analysis tools in software engineering: A systematic mapping study, Information and software Technology 151 (2022) 107018

work page 2022
[22]

M. R. Islam, M. F. Zibran, Sentistrength-se: Exploiting domain speci- ficity for improved sentiment analysis in software engineering text, Jour- nal of Systems and Software 145 (2018) 125–146

work page 2018
[23]

M. R. Islam, M. F. Zibran, Deva: sensing emotions in the valence arousal space in software engineering text, in: Proceedings of the 33rd annual ACM symposium on applied computing, 2018, pp. 1536–1543

work page 2018
[24]

M. R. Islam, M. K. Ahmmed, M. F. Zibran, Marvalous: Machine learn- ing based detection of emotions in the valence-arousal space in software engineering text, in: Proceedings of the 34th ACM/SIGAPP Sympo- sium on Applied Computing, 2019, pp. 1786–1793

work page 2019
[25]

Murgia, M

A. Murgia, M. Ortu, P. Tourani, B. Adams, S. Demeyer, An exploratory qualitative and quantitative analysis of emotions in issue report com- ments of open source systems, Empirical Software Engineering 23 (2018) 521–564

work page 2018
[26]

Cagnoni, L

S. Cagnoni, L. Cozzini, G. Lombardo, M. Mordonini, A. Poggi, M. Tomaiuolo, Emotion-based analysis of programming languages on stack overflow, ICT Express 6 (3) (2020) 238–242. 35

work page 2020
[27]

Uddin, Y.-G

G. Uddin, Y.-G. Gu´ eh´ enuc, F. Khomh, C. K. Roy, An empirical study of the effectiveness of an ensemble of stand-alone sentiment detection tools for software engineering datasets, ACM Transactions on Software Engineering and Methodology (TOSEM) 31 (3) (2022) 1–38

work page 2022
[28]

Ahmed, A

T. Ahmed, A. Bosu, A. Iqbal, S. Rahimi, Senticr: A customized sentiment analysis tool for code review interactions, in: 2017 32nd IEEE/ACM International Conference on Automated Software Engineer- ing (ASE), IEEE, 2017, pp. 106–111

work page 2017
[29]

J. Ding, H. Sun, X. Wang, X. Liu, Entity-level sentiment analysis of issue comments, in: Proceedings of the 3rd International Workshop on Emotion Awareness in Software Engineering, 2018, pp. 7–13

work page 2018
[30]

Biswas, M

E. Biswas, M. E. Karabulut, L. Pollock, K. Vijay-Shanker, Achieving re- liable sentiment analysis in the software engineering domain using bert, in: 2020 IEEE International conference on software maintenance and evolution (ICSME), IEEE, 2020, pp. 162–173

work page 2020
[31]

Batra, N

H. Batra, N. S. Punn, S. K. Sonbhadra, S. Agarwal, Bert-based sen- timent analysis: A software engineering perspective, in: Database and Expert Systems Applications: 32nd International Conference, DEXA 2021, Virtual Event, September 27–30, 2021, Proceedings, Part I 32, Springer, 2021, pp. 138–148

work page 2021
[32]

Bleyl, E

D. Bleyl, E. K. Buxton, Emotion recognition on stackoverflow posts using bert, in: 2022 IEEE International Conference on Big Data (Big Data), IEEE, 2022, pp. 5881–5885

work page 2022
[33]

K. Sun, X. Shi, H. Gao, H. Kuang, X. Ma, G. Rong, D. Shao, Z. Zhao, H. Zhang, Incorporating pre-trained transformer models into textcnn for sentiment analysis on software engineering texts, in: Proceedings of the 13th Asia-Pacific Symposium on Internetware, 2022, pp. 127–136

work page 2022
[34]

Shafikuzzaman, M

M. Shafikuzzaman, M. R. Islam, A. C. Rolli, S. Akhter, N. Seliya, An empirical evaluation of the zero-shot, few-shot, and traditional fine- tuning based pretrained language models for sentiment analysis in soft- ware engineering, IEEE Access (2024)

work page 2024
[35]

V. R. B.-G. Caldiera, H. D. Rombach, Goal question metric paradigm, Encyclopedia of software engineering 1 (528-532) (1994) 6. 36

work page 1994
[36]

M. M. Imran, Y. Jain, P. Chatterjee, K. Damevski, Data augmentation for improving emotion recognition in software engineering communica- tion, in: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–13

work page 2022
[37]

C. D. Manning, P. Raghavan, H. Sch¨ utze, Introduction to information retrieval, Cambridge university press, 2008

work page 2008
[38]

I. H. Witten, E. Frank, M. A. Hall, C. J. Pal, M. Data, Practical ma- chine learning tools and techniques, in: Data mining, Vol. 2, Elsevier Amsterdam, The Netherlands, 2005, pp. 403–413

work page 2005
[39]

Tantithamthavorn, S

C. Tantithamthavorn, S. McIntosh, A. E. Hassan, K. Matsumoto, The impact of automated parameter optimization on defect prediction mod- els, IEEE Transactions on Software Engineering 45 (7) (2018) 683–711

work page 2018
[40]

M.-T. Puth, M. Neuh¨ auser, G. D. Ruxton, Effective use of spearman’s and kendall’s correlation coefficients for association between two mea- sured traits, Animal Behaviour 102 (2015) 77–84

work page 2015
[41]

M. L. McHugh, Interrater reliability: the kappa statistic, Biochemia medica 22 (3) (2012) 276–282

work page 2012
[42]

Shull, J

F. Shull, J. Singer, D. I. Sjøberg, Guide to advanced empirical software engineering, Springer, 2007

work page 2007
[43]

Empirical standards for software engineering research,

P. Ralph, N. b. Ali, S. Baltes, D. Bianculli, J. Diaz, Y. Dittrich, N. Ernst, M. Felderer, R. Feldt, A. Filieri, et al., Empirical standards for software engineering research, arXiv preprint arXiv:2010.03525 (2020). 37

work page arXiv 2010

[1] [1]

Zhang, B

T. Zhang, B. Xu, F. Thung, S. A. Haryono, D. Lo, L. Jiang, Sentiment analysis for software engineering: How far can pre-trained transformer models go?, in: 2020 IEEE Intealefato2018sentimentrnational Confer- ence on Software Maintenance and Evolution (ICSME), IEEE, 2020, pp. 70–80

work page 2020

[2] [2]

Zhang, I

T. Zhang, I. C. Irsan, F. Thung, D. Lo, Revisiting sentiment analy- sis for software engineering in the era of large language models, ACM Transactions on Software Engineering and Methodology 34 (3) (2025) 1–30

work page 2025

[3] [3]

Obaidi, J

M. Obaidi, J. Kl¨ under, Development and application of sentiment anal- ysis tools in software engineering: A systematic literature review, in: Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering, 2021, pp. 80–89

work page 2021

[4] [4]

B. Lin, F. Zampetti, G. Bavota, M. Di Penta, M. Lanza, R. Oliveto, Sentiment analysis for software engineering: How far can we go?, in: Proceedings of the 40th international conference on software engineering, 2018, pp. 94–104

work page 2018

[5] [5]

Sajadi, K

A. Sajadi, K. Damevski, P. Chatterjee, Towards understanding emotions in informal developer interactions: A gitter chat study, in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 2097– 2101

work page 2023

[6] [6]

Novielli, F

N. Novielli, F. Calefato, D. Dongiovanni, D. Girardi, F. Lanubile, Can we use se-specific sentiment analysis tools in a cross-platform setting?, in: Proceedings of the 17th International Conference on Mining Software Repositories, 2020, pp. 158–168. 33

work page 2020

[7] [7]

Uddin, F

G. Uddin, F. Khomh, Automatic mining of opinions expressed about apis in stack overflow, IEEE Transactions on Software Engineering 47 (3) (2019) 522–559

work page 2019

[8] [8]

Calefato, F

F. Calefato, F. Lanubile, F. Maiorano, N. Novielli, Sentiment polarity detection for software development, in: Proceedings of the 40th Inter- national Conference on Software Engineering, 2018, pp. 128–128

work page 2018

[9] [9]

Jongeling, S

R. Jongeling, S. Datta, A. Serebrenik, Choosing your weapons: On sen- timent analysis tools for software engineering research, in: 2015 IEEE International Conference on Software Maintenance and Evolution (IC- SME), IEEE, 2015, pp. 531–535

work page 2015

[10] [10]

Tourani, Y

P. Tourani, Y. Jiang, B. Adams, Monitoring sentiment in open source mailing lists: exploratory study on the apache ecosystem., in: CASCON, Vol. 14, 2014, pp. 34–44

work page 2014

[11] [11]

Novielli, D

N. Novielli, D. Girardi, F. Lanubile, A benchmark study on sentiment analysis for software engineering research, in: Proceedings of the 15th International Conference on Mining Software Repositories, 2018, pp. 364–375

work page 2018

[12] [12]

B. Lin, N. Cassee, A. Serebrenik, G. Bavota, N. Novielli, M. Lanza, Opinion mining for software development: a systematic literature re- view, ACM Transactions on Software Engineering and Methodology (TOSEM) 31 (3) (2022) 1–41

work page 2022

[13] [13]

Tunstall, L

L. Tunstall, L. Von Werra, T. Wolf, Natural language processing with transformers, ” O’Reilly Media, Inc.”, 2022

work page 2022

[14] [14]

Alammar, M

J. Alammar, M. Grootendorst, Hands-On Large Language Models: Lan- guage Understanding and Generation, ” O’Reilly Media, Inc.”, 2024

work page 2024

[15] [15]

S. P. Veeranna, J. Nam, E. L. Mencıa, J. F¨ urnkranz, Using semantic similarity for multi-label zero-shot classification of text documents, in: Proceeding of european symposium on artificial neural networks, com- putational intelligence and machine learning. bruges, belgium: Elsevier, 2016, pp. 423–428

work page 2016

[16] [16]

Alhoshan, A

W. Alhoshan, A. Ferrari, L. Zhao, Zero-shot learning for requirements classification: An exploratory study, Information and Software Technol- ogy 159 (2023) 107202. 34

work page 2023

[17] [17]

W. Yin, J. Hay, D. Roth, Benchmarking zero-shot text classifica- tion: Datasets, evaluation and entailment approach, arXiv preprint arXiv:1909.00161 (2019)

work page arXiv 1909

[18] [18]

Halder, A

K. Halder, A. Akbik, J. Krapac, R. Vollgraf, Task-aware representation of sentences for generic text classification, in: Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 3202– 3213

work page 2020

[19] [19]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901

work page 2020

[20] [20]

S´ anchez-Gord´ on, R

M. S´ anchez-Gord´ on, R. Colomo-Palacios, Taking the emotional pulse of software engineering—a systematic literature review of empirical stud- ies, Information and Software Technology 115 (2019) 23–43

work page 2019

[21] [21]

Obaidi, L

M. Obaidi, L. Nagel, A. Specht, J. Kl¨ under, Sentiment analysis tools in software engineering: A systematic mapping study, Information and software Technology 151 (2022) 107018

work page 2022

[22] [22]

M. R. Islam, M. F. Zibran, Sentistrength-se: Exploiting domain speci- ficity for improved sentiment analysis in software engineering text, Jour- nal of Systems and Software 145 (2018) 125–146

work page 2018

[23] [23]

M. R. Islam, M. F. Zibran, Deva: sensing emotions in the valence arousal space in software engineering text, in: Proceedings of the 33rd annual ACM symposium on applied computing, 2018, pp. 1536–1543

work page 2018

[24] [24]

M. R. Islam, M. K. Ahmmed, M. F. Zibran, Marvalous: Machine learn- ing based detection of emotions in the valence-arousal space in software engineering text, in: Proceedings of the 34th ACM/SIGAPP Sympo- sium on Applied Computing, 2019, pp. 1786–1793

work page 2019

[25] [25]

Murgia, M

A. Murgia, M. Ortu, P. Tourani, B. Adams, S. Demeyer, An exploratory qualitative and quantitative analysis of emotions in issue report com- ments of open source systems, Empirical Software Engineering 23 (2018) 521–564

work page 2018

[26] [26]

Cagnoni, L

S. Cagnoni, L. Cozzini, G. Lombardo, M. Mordonini, A. Poggi, M. Tomaiuolo, Emotion-based analysis of programming languages on stack overflow, ICT Express 6 (3) (2020) 238–242. 35

work page 2020

[27] [27]

Uddin, Y.-G

G. Uddin, Y.-G. Gu´ eh´ enuc, F. Khomh, C. K. Roy, An empirical study of the effectiveness of an ensemble of stand-alone sentiment detection tools for software engineering datasets, ACM Transactions on Software Engineering and Methodology (TOSEM) 31 (3) (2022) 1–38

work page 2022

[28] [28]

Ahmed, A

T. Ahmed, A. Bosu, A. Iqbal, S. Rahimi, Senticr: A customized sentiment analysis tool for code review interactions, in: 2017 32nd IEEE/ACM International Conference on Automated Software Engineer- ing (ASE), IEEE, 2017, pp. 106–111

work page 2017

[29] [29]

J. Ding, H. Sun, X. Wang, X. Liu, Entity-level sentiment analysis of issue comments, in: Proceedings of the 3rd International Workshop on Emotion Awareness in Software Engineering, 2018, pp. 7–13

work page 2018

[30] [30]

Biswas, M

E. Biswas, M. E. Karabulut, L. Pollock, K. Vijay-Shanker, Achieving re- liable sentiment analysis in the software engineering domain using bert, in: 2020 IEEE International conference on software maintenance and evolution (ICSME), IEEE, 2020, pp. 162–173

work page 2020

[31] [31]

Batra, N

H. Batra, N. S. Punn, S. K. Sonbhadra, S. Agarwal, Bert-based sen- timent analysis: A software engineering perspective, in: Database and Expert Systems Applications: 32nd International Conference, DEXA 2021, Virtual Event, September 27–30, 2021, Proceedings, Part I 32, Springer, 2021, pp. 138–148

work page 2021

[32] [32]

Bleyl, E

D. Bleyl, E. K. Buxton, Emotion recognition on stackoverflow posts using bert, in: 2022 IEEE International Conference on Big Data (Big Data), IEEE, 2022, pp. 5881–5885

work page 2022

[33] [33]

K. Sun, X. Shi, H. Gao, H. Kuang, X. Ma, G. Rong, D. Shao, Z. Zhao, H. Zhang, Incorporating pre-trained transformer models into textcnn for sentiment analysis on software engineering texts, in: Proceedings of the 13th Asia-Pacific Symposium on Internetware, 2022, pp. 127–136

work page 2022

[34] [34]

Shafikuzzaman, M

M. Shafikuzzaman, M. R. Islam, A. C. Rolli, S. Akhter, N. Seliya, An empirical evaluation of the zero-shot, few-shot, and traditional fine- tuning based pretrained language models for sentiment analysis in soft- ware engineering, IEEE Access (2024)

work page 2024

[35] [35]

V. R. B.-G. Caldiera, H. D. Rombach, Goal question metric paradigm, Encyclopedia of software engineering 1 (528-532) (1994) 6. 36

work page 1994

[36] [36]

M. M. Imran, Y. Jain, P. Chatterjee, K. Damevski, Data augmentation for improving emotion recognition in software engineering communica- tion, in: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–13

work page 2022

[37] [37]

C. D. Manning, P. Raghavan, H. Sch¨ utze, Introduction to information retrieval, Cambridge university press, 2008

work page 2008

[38] [38]

I. H. Witten, E. Frank, M. A. Hall, C. J. Pal, M. Data, Practical ma- chine learning tools and techniques, in: Data mining, Vol. 2, Elsevier Amsterdam, The Netherlands, 2005, pp. 403–413

work page 2005

[39] [39]

Tantithamthavorn, S

C. Tantithamthavorn, S. McIntosh, A. E. Hassan, K. Matsumoto, The impact of automated parameter optimization on defect prediction mod- els, IEEE Transactions on Software Engineering 45 (7) (2018) 683–711

work page 2018

[40] [40]

M.-T. Puth, M. Neuh¨ auser, G. D. Ruxton, Effective use of spearman’s and kendall’s correlation coefficients for association between two mea- sured traits, Animal Behaviour 102 (2015) 77–84

work page 2015

[41] [41]

M. L. McHugh, Interrater reliability: the kappa statistic, Biochemia medica 22 (3) (2012) 276–282

work page 2012

[42] [42]

Shull, J

F. Shull, J. Singer, D. I. Sjøberg, Guide to advanced empirical software engineering, Springer, 2007

work page 2007

[43] [43]

Empirical standards for software engineering research,

P. Ralph, N. b. Ali, S. Baltes, D. Bianculli, J. Diaz, Y. Dittrich, N. Ernst, M. Felderer, R. Feldt, A. Filieri, et al., Empirical standards for software engineering research, arXiv preprint arXiv:2010.03525 (2020). 37

work page arXiv 2010