arxiv: 2605.03439 · v1 · submitted 2026-05-05 · 💻 cs.CL

Recognition: unknown

Benchmarking Logistic Regression, SVM, Naive Bayes, and IndoBERT Fine-Tuning for Sentiment Analysis on Indonesian Product Reviews

Nabila Zakiyah Zahra , Salwa Farhanatussaidah , Nasywa Nur Afifah , Luluk Muthoharoh , Ardika Satria , Martin C.T. Manullang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords sentiment analysisIndonesian languageIndoBERTsupport vector machineTF-IDFclass imbalancemodel benchmarkingproduct reviews

0 comments

The pith

Linear support vector machine outperformed fine-tuned IndoBERT on Indonesian product review sentiment because it used the full dataset while the transformer used only a sample.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks traditional machine learning models against a transformer for classifying Indonesian e-commerce product reviews into positive, neutral, or negative categories. Using TF-IDF features, the Linear SVM reached 97.60 percent accuracy and 0.5510 macro F1-score, while IndoBERT fine-tuning reached 88.70 percent accuracy and 0.5088 macro F1-score. The authors explain the difference by the fact that the baseline models trained on the complete collection of reviews, whereas computational limits forced the transformer to train on a smaller random subset. They handle the severe class imbalance in the data with weighted losses for both approaches and deploy the resulting model as a simple web application.

Core claim

On the Tokopedia Product Reviews 2025 dataset for three-class sentiment analysis, the Linear SVC model with TF-IDF features significantly outperformed the fine-tuned IndoBERT model, reaching 97.60 percent accuracy and a macro F1-score of 0.5510 compared with IndoBERT's 88.70 percent accuracy and 0.5088 macro F1-score. The performance gap arose primarily because the traditional baselines trained on the full corpus while the transformer was restricted to a sampled subset to manage computation. The work also applies balanced class weights to the baselines and a custom weighted cross-entropy loss to IndoBERT to address imbalance, then shows the pipeline running in a Gradio web application.

What carries the argument

Side-by-side evaluation of TF-IDF vectorization paired with Linear SVC against fine-tuning of the IndoBERT transformer for sequence classification, with class-weighting and custom loss to correct for imbalance in the review data.

Load-bearing premise

The observed performance difference can be read as a meaningful model comparison even though the traditional approaches trained on the entire dataset while IndoBERT trained only on a sampled portion of the same data.

What would settle it

Re-train the IndoBERT model on the complete unsampled Tokopedia dataset and check whether its accuracy then reaches or exceeds the Linear SVC result of 97.60 percent.

Figures

Figures reproduced from arXiv: 2605.03439 by Ardika Satria, Luluk Muthoharoh, Martin C.T. Manullang, Nabila Zakiyah Zahra, Nasywa Nur Afifah, Salwa Farhanatussaidah.

**Figure 1.** Figure 1: Confusion matrices from the latest experiments. The baseline matrix clearly illustrates the pattern of inter-class view at source ↗

**Figure 2.** Figure 2: Confusion matrices from the latest experiments. The baseline matrix clearly illustrates the pattern of inter-class view at source ↗

read the original abstract

The exponential growth of e-commerce platforms in Indonesia has generated a massive volume of user-generated product reviews. Analyzing the sentiment of these reviews is critical for measuring customer satisfaction and identifying product issues at scale. This paper benchmarks traditional Machine Learning (ML) approaches against a Transformer-based Deep Learning model for a three-class sentiment analysis task (positive, neutral, negative) on the Tokopedia Product Reviews 2025 dataset. We implemented Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction coupled with three algorithms: Logistic Regression, Linear Support Vector Machine (SVM), and Multinomial Naive Bayes as robust baselines. Subsequently, we fine-tuned the IndoBERT model (indobenchmark/indobert-base-p1) for contextual sequence classification. To computationally address the severe class imbalance inherent in e-commerce feedback, we applied balanced class weights for the baseline models and engineered a custom weighted cross-entropy loss function within the IndoBERT training loop, following the broader motivation of imbalanced-learning research. Our comprehensive evaluation using Accuracy, Macro F1-score, and Weighted F1-score revealed that the traditional Linear SVC model significantly outperformed the IndoBERT model in our experimental setup, achieving an Accuracy of 97.60% and a Macro F1-score of 0.5510, compared to IndoBERT's 88.70% and 0.5088. Detailed analysis indicates that this performance gap was primarily driven by discrepancies in the data sampling regimes, where baselines utilized the full corpus while the Transformer was constrained to a sampled subset. Finally, we demonstrate the practical viability of our pipeline by deploying the final sentiment classification model as an interactive Gradio web application.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The headline result comparing Linear SVC to IndoBERT is not a valid model comparison because the baselines trained on the full corpus while IndoBERT used only a sampled subset.

read the letter

The paper benchmarks TF-IDF plus Logistic Regression, Linear SVC, and Naive Bayes against fine-tuned IndoBERT on the Tokopedia Product Reviews 2025 dataset for three-class sentiment. The authors apply class weights to the classical models and a custom weighted cross-entropy loss for IndoBERT to handle imbalance, then deploy the best model in a Gradio app. They also explicitly note that the performance gap (SVC at 97.6% accuracy and 0.5510 macro F1 versus IndoBERT at 88.7% and 0.5088) stems mainly from the different data volumes rather than model differences. That transparency is the clearest strength here; many papers would have buried the sampling detail. The work is otherwise a standard application of existing tools to a new but narrow domain-specific corpus with no new algorithms or theoretical claims. The central limitation is exactly the one the stress-test flags: without an ablation that equalizes training-set size, the numbers cannot be read as evidence that one approach is better than the other. Accuracy and macro F1 both scale with data volume for linear and transformer models alike, so the observed gap is expected and uninformative. No code, full hyperparameter details, or controlled follow-up experiments are provided, which further limits reuse. This paper is mainly for practitioners who want a ready pipeline or deployment example for Indonesian e-commerce text. A reader focused on reliable model comparisons or low-resource NLP methods will find little to take away. The citation pattern is ordinary and the thinking is straightforward, but the uncontrolled variable is a real design flaw rather than a minor caveat. I would not cite the performance numbers, but the dataset description and imbalance-handling details could be referenced if cleaned up. It deserves peer review after the authors add the missing equal-size ablation; the underlying task and data are reasonable.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks TF-IDF + Logistic Regression, Linear SVM, and Multinomial Naive Bayes against fine-tuned IndoBERT (indobenchmark/indobert-base-p1) for three-class (positive/neutral/negative) sentiment classification on the Tokopedia Product Reviews 2025 dataset. Class weighting and a custom weighted cross-entropy loss are used to address imbalance. The headline result is that Linear SVC reaches 97.60% accuracy and 0.5510 Macro F1 while IndoBERT reaches 88.70% accuracy and 0.5088 Macro F1; the authors attribute the gap to baselines training on the full corpus versus a sampled subset for IndoBERT and also release a Gradio demo.

Significance. If the data-volume confound were removed, the work would usefully illustrate when linear models remain competitive on imbalanced Indonesian e-commerce text and would strengthen the case for pragmatic baselines before scaling to transformers. The explicit acknowledgment of the sampling discrepancy and the reproducible deployment are positive, but the current uncontrolled comparison prevents strong claims about relative model merit.

major comments (2)

[Abstract and experimental results] The central claim that Linear SVC 'significantly outperformed' IndoBERT (abstract and results) rests on an uncontrolled variable: baselines are trained on the full Tokopedia corpus while IndoBERT uses only a sampled subset. Because accuracy and Macro F1 are known to improve with training-set size for both linear and transformer models, the reported gap cannot be attributed to architecture or loss design. No ablation that equalizes the number of training examples is presented.
[Results and discussion] The manuscript notes the sampling discrepancy as the primary driver of the gap but still presents the raw numbers as a model comparison. Without a controlled experiment (or at minimum a quantitative estimate of the performance delta attributable to data volume alone), the benchmarking contribution is weakened and the conclusion that traditional models are preferable cannot be drawn.

minor comments (2)

[Experimental setup] The exact sampling procedure, subset size, and stratification method used for IndoBERT are not stated with sufficient precision to allow replication of the exact training regime.
[Methods] Hyper-parameter search ranges, early-stopping criteria, and the precise form of the custom weighted cross-entropy loss (including how class weights were computed) should be reported in a table or appendix for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and precise comments on our manuscript. We agree that the uncontrolled difference in training data volume is a significant limitation that prevents strong claims about relative model performance, and we will revise the paper to qualify our results more carefully.

read point-by-point responses

Referee: [Abstract and experimental results] The central claim that Linear SVC 'significantly outperformed' IndoBERT (abstract and results) rests on an uncontrolled variable: baselines are trained on the full Tokopedia corpus while IndoBERT uses only a sampled subset. Because accuracy and Macro F1 are known to improve with training-set size for both linear and transformer models, the reported gap cannot be attributed to architecture or loss design. No ablation that equalizes the number of training examples is presented.

Authors: We agree that the reported gap cannot be attributed to model architecture or loss design alone. Although the manuscript already identifies the sampling discrepancy as the primary driver, we will revise the abstract to remove the phrasing 'significantly outperformed' when describing the models and instead report the observed metrics under their respective data regimes. We will also add an explicit statement that the results do not constitute a controlled model comparison. revision: yes
Referee: [Results and discussion] The manuscript notes the sampling discrepancy as the primary driver of the gap but still presents the raw numbers as a model comparison. Without a controlled experiment (or at minimum a quantitative estimate of the performance delta attributable to data volume alone), the benchmarking contribution is weakened and the conclusion that traditional models are preferable cannot be drawn.

Authors: We accept this assessment. In the revised manuscript we will reframe the results and discussion sections to present the numbers strictly as observed performances under unequal data conditions, remove any implication that traditional models are preferable in general, and add a dedicated limitations paragraph on the need for equalized training-set sizes in future benchmarking work. revision: yes

standing simulated objections not resolved

We are unable to perform a controlled ablation that equalizes the number of training examples or to supply a quantitative estimate of the performance delta due solely to data volume, because fine-tuning IndoBERT on the full corpus exceeds our available computational resources.

Circularity Check

0 steps flagged

No circularity: empirical metrics obtained from held-out evaluation

full rationale

The manuscript is a standard empirical benchmarking study. It trains Logistic Regression, Linear SVC, Naive Bayes on TF-IDF features and fine-tunes IndoBERT, then reports Accuracy, Macro F1 and Weighted F1 on held-out test data. No equations, derivations or first-principles claims are present; performance numbers are not defined in terms of themselves, nor are fitted parameters renamed as predictions. The authors explicitly attribute the observed gap to unequal training-set sizes (full corpus vs. sampled subset) rather than asserting a controlled architectural comparison. No self-citations, uniqueness theorems or ansatzes are invoked as load-bearing steps. The result is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central comparison rests on the assumption that TF-IDF features and standard cross-entropy (even when re-weighted) are sufficient representations, plus the untested premise that the sampled subset used for IndoBERT is representative of the full corpus.

free parameters (1)

class weights
Balanced class weights applied to baselines and custom weighted cross-entropy loss for IndoBERT; exact weight values not stated in abstract.

pith-pipeline@v0.9.0 · 5637 in / 1405 out tokens · 51136 ms · 2026-05-07T16:42:32.211475+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Foundations and Trends in Information Retrieval , volume=

Opinion mining and sentiment analysis , author=. Foundations and Trends in Information Retrieval , volume=
[2]

2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA) , pages=

Improving bi-lstm performance for Indonesian sentiment analysis using paragraph vector , author=. 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA) , pages=. 2019 , organization=

2019
[3]

European Conference on Machine Learning , pages=

Text categorization with support vector machines: Learning with many relevant features , author=. European Conference on Machine Learning , pages=. 1998 , organization=

1998
[4]

Information Processing & Management , volume=

Term-weighting approaches in automatic text retrieval , author=. Information Processing & Management , volume=
[5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review arXiv
[6]

arXiv preprint arXiv:2009.05387 , year=

IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding , author=. arXiv preprint arXiv:2009.05387 , year=

work page arXiv 2009
[7]

arXiv preprint arXiv:2011.00677 , year=

IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP , author=. arXiv preprint arXiv:2011.00677 , year=

work page arXiv 2011
[8]

ACM Computing Surveys , volume=

Deep learning--based text classification: A comprehensive review , author=. ACM Computing Surveys , volume=
[9]

2025 , publisher=

Tokopedia Products Dataset (2025) , author=. 2025 , publisher=

2025
[10]

Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild

Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild , author=. arXiv preprint arXiv:1906.02569 , year=

work page Pith review arXiv 1906
[11]

ACM SIGKDD Explorations Newsletter , volume=

Special issue on learning from imbalanced data sets , author=. ACM SIGKDD Explorations Newsletter , volume=