Do We Still Need Fine Tuning? Turkish Sentiment Analysis in the Era of Large Language Model

Sercan Karaka\c{s}; Yusuf \c{S}im\c{s}ek

arxiv: 2606.29614 · v1 · pith:XFXMLCHVnew · submitted 2026-06-28 · 💻 cs.CL · cs.AI

Do We Still Need Fine Tuning? Turkish Sentiment Analysis in the Era of Large Language Model

Sercan Karaka\c{s} , Yusuf \c{S}im\c{s}ek This is my paper

Pith reviewed 2026-06-30 07:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Turkish sentiment analysisfine-tuninglarge language modelsBERTurkneutral classzero-shot promptinge-commerce reviewsthree-class classification

0 comments

The pith

Fine-tuned BERTurk models outperform prompted large language models on Turkish three-class sentiment analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether supervised fine-tuning remains necessary for Turkish sentiment analysis by comparing classical machine learning methods, fine-tuned pretrained language models, and prompted large language models on an e-commerce review dataset labeled negative, neutral, or positive. Fine-tuned BERTurk models achieve the highest performance in the full three-class task and beat all prompted large language models. The neutral class proves the main difficulty, as prompted models often collapse neutral reviews into positive or negative categories and perform better when restricted to binary classification. The results indicate that prompted large language models do not yet match supervised fine-tuning for realistic zero-shot Turkish sentiment classification that includes neutrals.

Core claim

Fine-tuned BERTurk models perform best overall and outperform all prompted large language models in the full three-class task on a Turkish e-commerce review dataset. The neutral class emerges as the main difficulty: while several large language models are much more competitive in binary positive-negative classification, they degrade substantially in the three-class setting by collapsing neutral reviews into polarized categories. The findings suggest that, in realistic Turkish sentiment classification, prompted large language models do not yet match supervised fine-tuning in the zero-shot setting, and that including the neutral class is crucial for robust evaluation.

What carries the argument

Direct performance comparison of fine-tuned BERTurk against zero-shot prompted large language models on three-class Turkish sentiment classification of e-commerce reviews.

Load-bearing premise

The prompting strategies and model choices for the large language models represent a fair and near-optimal zero-shot baseline rather than an under-optimized one.

What would settle it

An experiment in which optimized prompts or few-shot examples allow prompted large language models to match or exceed fine-tuned BERTurk accuracy on the identical three-class Turkish dataset.

Figures

Figures reproduced from arXiv: 2606.29614 by Sercan Karaka\c{s}, Yusuf \c{S}im\c{s}ek.

read the original abstract

This study examines whether supervised fine-tuning remains necessary for Turkish sentiment analysis in the era of large language models. We compare classical machine learning methods, fine-tuned pretrained language models, and prompted large language models on a Turkish e-commerce review dataset with negative, neutral, and positive labels. Fine-tuned BERTurk models perform best overall and outperform all prompted large language models in the full three-class task. The neutral class emerges as the main difficulty: while several large language models are much more competitive in binary positive--negative classification, they degrade substantially in the three-class setting by collapsing neutral reviews into polarized categories. The findings suggest that, in realistic Turkish sentiment classification, prompted large language models do not yet match supervised fine-tuning in the zero-shot setting, and that including the neutral class is crucial for robust evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuned BERTurk beats the zero-shot LLMs on this Turkish three-class task mainly because the LLMs mishandle neutrals, but the prompting details are thin enough that the gap could be narrower with better setup.

read the letter

The paper's core finding is that fine-tuned BERTurk models come out ahead of prompted LLMs on a Turkish e-commerce review dataset when the task includes a neutral class. The LLMs hold up better on binary positive-negative splits but drop when neutrals are added, often folding them into the extremes.

What stands out is the direct comparison on a non-English dataset with real three-class labels. That neutral-class collapse is a concrete observation worth noting for anyone doing sentiment work in similar languages. The experimental outline in the abstract is coherent and the results are measured on held-out data, so there is no obvious circularity.

The main limitation is the lack of prompt details, model selection criteria, or any mention of Turkish-specific phrasing or few-shot variants. Without those, it is hard to tell whether the prompted models were given a fair shot or just generic instructions. If the prompts were under-optimized, the headline claim that fine-tuning is still required rests on a weaker baseline than it appears.

This is useful for Turkish NLP practitioners who need to decide between fine-tuning and prompting on modest data. It is not a broad theoretical advance, but the empirical pattern is worth having on record. I would send it to peer review so the prompting setup and any additional controls can be checked.

Referee Report

2 major / 2 minor

Summary. The paper compares classical ML baselines, fine-tuned BERTurk models, and zero-shot prompted LLMs on a Turkish e-commerce review dataset for three-class sentiment analysis (negative, neutral, positive). It reports that fine-tuned BERTurk achieves the highest performance overall, while prompted LLMs are competitive in binary positive-negative classification but degrade markedly in the three-class task by misclassifying neutral reviews into polarized categories. The authors conclude that supervised fine-tuning remains necessary for robust Turkish sentiment classification in the zero-shot LLM setting and that neutral-class evaluation is essential.

Significance. If the empirical results hold under improved prompting, the work supplies a focused, language-specific benchmark showing that zero-shot LLM prompting has not yet closed the gap with fine-tuning for Turkish sentiment analysis when neutral labels are included. It provides concrete evidence on the neutral class as a persistent failure mode and underscores the value of realistic multi-class evaluation protocols for non-English tasks. The study is a straightforward empirical contribution that can inform practitioners working on Turkish NLP applications.

major comments (2)

[Methods] Methods / Experimental Setup: The manuscript provides no prompt templates, temperature values, model selection criteria, or tests of Turkish-specific phrasing for the zero-shot LLM baselines. Because the central claim—that fine-tuned BERTurk outperforms all prompted LLMs—rests on these baselines representing a fair zero-shot comparison, the absence of these details leaves open the possibility that the observed gap is an artifact of under-optimization rather than an inherent limitation.
[Results] Results section (performance tables): The reported degradation of LLMs when moving from binary to three-class evaluation is load-bearing for the conclusion, yet no error analysis, confusion matrices, or example neutral misclassifications are supplied to characterize the failure mode. This weakens the ability to interpret whether the neutral class is intrinsically difficult or simply mishandled by the chosen prompts.

minor comments (2)

[Abstract] The abstract and conclusion could more explicitly qualify the scope as 'zero-shot prompting' to avoid overgeneralization to few-shot or instruction-tuned settings.
[Tables] Table captions and axis labels should consistently use the same model abbreviations as the main text for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and have incorporated revisions to improve the manuscript's clarity and reproducibility.

read point-by-point responses

Referee: [Methods] Methods / Experimental Setup: The manuscript provides no prompt templates, temperature values, model selection criteria, or tests of Turkish-specific phrasing for the zero-shot LLM baselines. Because the central claim—that fine-tuned BERTurk outperforms all prompted LLMs—rests on these baselines representing a fair zero-shot comparison, the absence of these details leaves open the possibility that the observed gap is an artifact of under-optimization rather than an inherent limitation.

Authors: We agree that the absence of these details limits reproducibility and interpretability. In the revised manuscript we now include the full prompt templates (both English and Turkish variants tested), temperature values (0.0 for all models to ensure determinism), model selection rationale (covering the most widely used open and proprietary LLMs available during the study period), and a brief discussion of Turkish-specific phrasing attempts. These additions demonstrate that the prompts were constructed following standard zero-shot practices and that the performance gap persists across reasonable prompt variations. revision: yes
Referee: [Results] Results section (performance tables): The reported degradation of LLMs when moving from binary to three-class evaluation is load-bearing for the conclusion, yet no error analysis, confusion matrices, or example neutral misclassifications are supplied to characterize the failure mode. This weakens the ability to interpret whether the neutral class is intrinsically difficult or simply mishandled by the chosen prompts.

Authors: We acknowledge that the lack of error analysis weakens the interpretation of the neutral-class collapse. The revised version now contains confusion matrices for the strongest fine-tuned and prompted models in both binary and three-class settings, plus three representative examples of neutral reviews that LLMs consistently misclassify as positive or negative. The added analysis shows that the degradation is driven by systematic neutral-to-polarized shifts rather than isolated prompt failures, thereby strengthening the central claim. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking on held-out data

full rationale

The paper conducts an empirical comparison of classical ML, fine-tuned PLMs, and prompted LLMs on a Turkish e-commerce review dataset with three-class labels. Results are obtained by direct measurement against held-out test data with no derivations, first-principles predictions, parameter fitting that is then renamed as prediction, or load-bearing self-citations. The central claim rests on observed performance numbers rather than any self-referential construction. This matches the default case of a self-contained empirical study against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical comparison paper; no new theoretical entities or derivations. Relies on standard supervised learning assumptions.

axioms (1)

domain assumption Dataset labels are reliable and the train/test split is representative of real Turkish e-commerce reviews.
Implicit in any benchmark comparison; if labels or distribution are noisy the performance ordering could change.

pith-pipeline@v0.9.1-grok · 5672 in / 1153 out tokens · 38407 ms · 2026-06-30T07:03:08.552615+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Opinion mining and sentiment analysis,

B. Pang and L. Lee, “Opinion mining and sentiment analysis,”Founda- tions and Trends in Information Retrieval, vol. 2, no. 1–2, pp. 1–135, 2008

2008
[2]

Liu,Sentiment Analysis and Opinion Mining

B. Liu,Sentiment Analysis and Opinion Mining. San Rafael, CA, USA: Morgan & Claypool Publishers, 2012

2012
[3]

Language models are few-shot learners,

T. B. Brownet al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, pp. 1877– 1901, 2020

1901
[4]

Finetuned language models are zero-shot learners,

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inProc. Int. Conf. Learn. Representations (ICLR), 2022

2022
[5]

Sentiment analysis in the era of large language models: A reality check,

W. Zhang, Y . Deng, B. Liu, S. Pan, and L. Bing, “Sentiment analysis in the era of large language models: A reality check,” inFindings of the Association for Computational Linguistics: NAACL 2024, pp. 3881– 3906, 2024

2024
[6]

Sentiment analysis in Turkish: Supervised, semi-supervised, and unsupervised techniques,

C. R. Aydın and T. G ¨ung¨or, “Sentiment analysis in Turkish: Supervised, semi-supervised, and unsupervised techniques,”Natural Language En- gineering, vol. 27, no. 4, pp. 455–483, 2021

2021
[7]

Turkish tweet sentiment analysis with word embedding and machine learning,

D. Ayata, M. Sarac ¸lar, and A. ¨Ozg¨ur, “Turkish tweet sentiment analysis with word embedding and machine learning,” in2017 25th Signal Processing and Communications Applications Conference (SIU), pp. 1– 4, 2017

2017
[8]

Twitter dataset and evaluation of transformers for Turkish sentiment analysis,

A. K ¨oksal and A. ¨Ozg¨ur, “Twitter dataset and evaluation of transformers for Turkish sentiment analysis,” in2021 29th Signal Processing and Communications Applications Conference (SIU), pp. 1–4, 2021

2021
[9]

A dataset and BERT-based models for targeted sentiment analysis on Turkish texts,

M. M. Mutlu and A. ¨Ozg¨ur, “A dataset and BERT-based models for targeted sentiment analysis on Turkish texts,” inProc. 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 467–472, 2022

2022
[10]

Multi- class sentiment analysis with e-commerce user reviews: Comparisons of classical and deep learning applications,

Y . S ¸ims ¸ek, M. B. Balci, M. Arzu, M. Kaya, and Y . Santur, “Multi- class sentiment analysis with e-commerce user reviews: Comparisons of classical and deep learning applications,” in2025 9th International Artificial Intelligence and Data Processing Symposium (IDAP), pp. 1–6, 2025

2025
[11]

Sentiment analysis in Turkish at different granularity levels,

R. Dehkharghani, B. Yanıko ˘glu, Y . Saygın, and K. Oflazer, “Sentiment analysis in Turkish at different granularity levels,”Natural Language Engineering, vol. 23, no. 4, pp. 535–559, 2017

2017
[12]

TRSAv1: A new benchmark dataset for classifying user reviews on Turkish e-commerce websites,

M. Aydo ˘gan and V . Kocaman, “TRSAv1: A new benchmark dataset for classifying user reviews on Turkish e-commerce websites,”Journal of Information Science, vol. 49, no. 6, pp. 1711–1725, 2023

2023
[13]

Evaluating the zero-shot robust- ness of instruction-tuned language models,

J. Sun, C. Shaib, and B. C. Wallace, “Evaluating the zero-shot robust- ness of instruction-tuned language models,” inProc. Int. Conf. Learn. Representations (ICLR), 2024

2024
[14]

Sentiment analysis: It’s compli- cated!,

K. Kenyon-Dean, E. Ahmed, S. Fujimoto, L. Georges-Filteau, K. Kaur, A. Lalande, S. Bhanderi, R. Belfer, N. Kanagasabai, R. Sarrazin- Gendron, R. Verma, and D. Ruths, “Sentiment analysis: It’s compli- cated!,” inProc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1886–1895, 2018

2018
[15]

Dealing with disagree- ments: Looking beyond the majority vote in subjective annotations,

A. M. Davani, M. D ´ıaz, and V . Prabhakaran, “Dealing with disagree- ments: Looking beyond the majority vote in subjective annotations,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 92–110, 2022

2022
[16]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, “Gemma 2: Improving Open Language Models at a Practical Size,”arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Gemma 3 Technical Report

G Team, “Gemma 3 Technical Report,”arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Introducing GPT-OSS,

OpenAI, “Introducing GPT-OSS,” 2025. [Online]. Available: https:// openai.com/index/introducing-gpt-oss/

2025
[19]

Introducing Llama 3.1: Our most capable models to date,

Meta, “Introducing Llama 3.1: Our most capable models to date,” 2024. [Online]. Available: https://ai.meta.com/blog/meta-llama-3-1/

2024
[20]

Magibu-11B: A Turkish-Native Multilingual Vision- Language Model with Optimized Tokenization,

A. Bayram, “Magibu-11B: A Turkish-Native Multilingual Vision- Language Model with Optimized Tokenization,” 2025. [Online]. Avail- able: https://huggingface.co/magibu/magibu-11b-v0.8

2025
[21]

Qwen3 Technical Report

Qwen Team, “Qwen3 Technical Report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Opinion mining and sentiment analysis,

B. Pang and L. Lee, “Opinion mining and sentiment analysis,”Founda- tions and Trends in Information Retrieval, vol. 2, no. 1–2, pp. 1–135, 2008

2008

[2] [2]

Liu,Sentiment Analysis and Opinion Mining

B. Liu,Sentiment Analysis and Opinion Mining. San Rafael, CA, USA: Morgan & Claypool Publishers, 2012

2012

[3] [3]

Language models are few-shot learners,

T. B. Brownet al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, pp. 1877– 1901, 2020

1901

[4] [4]

Finetuned language models are zero-shot learners,

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inProc. Int. Conf. Learn. Representations (ICLR), 2022

2022

[5] [5]

Sentiment analysis in the era of large language models: A reality check,

W. Zhang, Y . Deng, B. Liu, S. Pan, and L. Bing, “Sentiment analysis in the era of large language models: A reality check,” inFindings of the Association for Computational Linguistics: NAACL 2024, pp. 3881– 3906, 2024

2024

[6] [6]

Sentiment analysis in Turkish: Supervised, semi-supervised, and unsupervised techniques,

C. R. Aydın and T. G ¨ung¨or, “Sentiment analysis in Turkish: Supervised, semi-supervised, and unsupervised techniques,”Natural Language En- gineering, vol. 27, no. 4, pp. 455–483, 2021

2021

[7] [7]

Turkish tweet sentiment analysis with word embedding and machine learning,

D. Ayata, M. Sarac ¸lar, and A. ¨Ozg¨ur, “Turkish tweet sentiment analysis with word embedding and machine learning,” in2017 25th Signal Processing and Communications Applications Conference (SIU), pp. 1– 4, 2017

2017

[8] [8]

Twitter dataset and evaluation of transformers for Turkish sentiment analysis,

A. K ¨oksal and A. ¨Ozg¨ur, “Twitter dataset and evaluation of transformers for Turkish sentiment analysis,” in2021 29th Signal Processing and Communications Applications Conference (SIU), pp. 1–4, 2021

2021

[9] [9]

A dataset and BERT-based models for targeted sentiment analysis on Turkish texts,

M. M. Mutlu and A. ¨Ozg¨ur, “A dataset and BERT-based models for targeted sentiment analysis on Turkish texts,” inProc. 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 467–472, 2022

2022

[10] [10]

Multi- class sentiment analysis with e-commerce user reviews: Comparisons of classical and deep learning applications,

Y . S ¸ims ¸ek, M. B. Balci, M. Arzu, M. Kaya, and Y . Santur, “Multi- class sentiment analysis with e-commerce user reviews: Comparisons of classical and deep learning applications,” in2025 9th International Artificial Intelligence and Data Processing Symposium (IDAP), pp. 1–6, 2025

2025

[11] [11]

Sentiment analysis in Turkish at different granularity levels,

R. Dehkharghani, B. Yanıko ˘glu, Y . Saygın, and K. Oflazer, “Sentiment analysis in Turkish at different granularity levels,”Natural Language Engineering, vol. 23, no. 4, pp. 535–559, 2017

2017

[12] [12]

TRSAv1: A new benchmark dataset for classifying user reviews on Turkish e-commerce websites,

M. Aydo ˘gan and V . Kocaman, “TRSAv1: A new benchmark dataset for classifying user reviews on Turkish e-commerce websites,”Journal of Information Science, vol. 49, no. 6, pp. 1711–1725, 2023

2023

[13] [13]

Evaluating the zero-shot robust- ness of instruction-tuned language models,

J. Sun, C. Shaib, and B. C. Wallace, “Evaluating the zero-shot robust- ness of instruction-tuned language models,” inProc. Int. Conf. Learn. Representations (ICLR), 2024

2024

[14] [14]

Sentiment analysis: It’s compli- cated!,

K. Kenyon-Dean, E. Ahmed, S. Fujimoto, L. Georges-Filteau, K. Kaur, A. Lalande, S. Bhanderi, R. Belfer, N. Kanagasabai, R. Sarrazin- Gendron, R. Verma, and D. Ruths, “Sentiment analysis: It’s compli- cated!,” inProc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1886–1895, 2018

2018

[15] [15]

Dealing with disagree- ments: Looking beyond the majority vote in subjective annotations,

A. M. Davani, M. D ´ıaz, and V . Prabhakaran, “Dealing with disagree- ments: Looking beyond the majority vote in subjective annotations,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 92–110, 2022

2022

[16] [16]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, “Gemma 2: Improving Open Language Models at a Practical Size,”arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Gemma 3 Technical Report

G Team, “Gemma 3 Technical Report,”arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Introducing GPT-OSS,

OpenAI, “Introducing GPT-OSS,” 2025. [Online]. Available: https:// openai.com/index/introducing-gpt-oss/

2025

[19] [19]

Introducing Llama 3.1: Our most capable models to date,

Meta, “Introducing Llama 3.1: Our most capable models to date,” 2024. [Online]. Available: https://ai.meta.com/blog/meta-llama-3-1/

2024

[20] [20]

Magibu-11B: A Turkish-Native Multilingual Vision- Language Model with Optimized Tokenization,

A. Bayram, “Magibu-11B: A Turkish-Native Multilingual Vision- Language Model with Optimized Tokenization,” 2025. [Online]. Avail- able: https://huggingface.co/magibu/magibu-11b-v0.8

2025

[21] [21]

Qwen3 Technical Report

Qwen Team, “Qwen3 Technical Report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025