Recognition: unknown
Benchmarking Logistic Regression, SVM, Naive Bayes, and IndoBERT Fine-Tuning for Sentiment Analysis on Indonesian Product Reviews
Pith reviewed 2026-05-07 16:42 UTC · model grok-4.3
The pith
Linear support vector machine outperformed fine-tuned IndoBERT on Indonesian product review sentiment because it used the full dataset while the transformer used only a sample.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the Tokopedia Product Reviews 2025 dataset for three-class sentiment analysis, the Linear SVC model with TF-IDF features significantly outperformed the fine-tuned IndoBERT model, reaching 97.60 percent accuracy and a macro F1-score of 0.5510 compared with IndoBERT's 88.70 percent accuracy and 0.5088 macro F1-score. The performance gap arose primarily because the traditional baselines trained on the full corpus while the transformer was restricted to a sampled subset to manage computation. The work also applies balanced class weights to the baselines and a custom weighted cross-entropy loss to IndoBERT to address imbalance, then shows the pipeline running in a Gradio web application.
What carries the argument
Side-by-side evaluation of TF-IDF vectorization paired with Linear SVC against fine-tuning of the IndoBERT transformer for sequence classification, with class-weighting and custom loss to correct for imbalance in the review data.
Load-bearing premise
The observed performance difference can be read as a meaningful model comparison even though the traditional approaches trained on the entire dataset while IndoBERT trained only on a sampled portion of the same data.
What would settle it
Re-train the IndoBERT model on the complete unsampled Tokopedia dataset and check whether its accuracy then reaches or exceeds the Linear SVC result of 97.60 percent.
Figures
read the original abstract
The exponential growth of e-commerce platforms in Indonesia has generated a massive volume of user-generated product reviews. Analyzing the sentiment of these reviews is critical for measuring customer satisfaction and identifying product issues at scale. This paper benchmarks traditional Machine Learning (ML) approaches against a Transformer-based Deep Learning model for a three-class sentiment analysis task (positive, neutral, negative) on the Tokopedia Product Reviews 2025 dataset. We implemented Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction coupled with three algorithms: Logistic Regression, Linear Support Vector Machine (SVM), and Multinomial Naive Bayes as robust baselines. Subsequently, we fine-tuned the IndoBERT model (indobenchmark/indobert-base-p1) for contextual sequence classification. To computationally address the severe class imbalance inherent in e-commerce feedback, we applied balanced class weights for the baseline models and engineered a custom weighted cross-entropy loss function within the IndoBERT training loop, following the broader motivation of imbalanced-learning research. Our comprehensive evaluation using Accuracy, Macro F1-score, and Weighted F1-score revealed that the traditional Linear SVC model significantly outperformed the IndoBERT model in our experimental setup, achieving an Accuracy of 97.60% and a Macro F1-score of 0.5510, compared to IndoBERT's 88.70% and 0.5088. Detailed analysis indicates that this performance gap was primarily driven by discrepancies in the data sampling regimes, where baselines utilized the full corpus while the Transformer was constrained to a sampled subset. Finally, we demonstrate the practical viability of our pipeline by deploying the final sentiment classification model as an interactive Gradio web application.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks TF-IDF + Logistic Regression, Linear SVM, and Multinomial Naive Bayes against fine-tuned IndoBERT (indobenchmark/indobert-base-p1) for three-class (positive/neutral/negative) sentiment classification on the Tokopedia Product Reviews 2025 dataset. Class weighting and a custom weighted cross-entropy loss are used to address imbalance. The headline result is that Linear SVC reaches 97.60% accuracy and 0.5510 Macro F1 while IndoBERT reaches 88.70% accuracy and 0.5088 Macro F1; the authors attribute the gap to baselines training on the full corpus versus a sampled subset for IndoBERT and also release a Gradio demo.
Significance. If the data-volume confound were removed, the work would usefully illustrate when linear models remain competitive on imbalanced Indonesian e-commerce text and would strengthen the case for pragmatic baselines before scaling to transformers. The explicit acknowledgment of the sampling discrepancy and the reproducible deployment are positive, but the current uncontrolled comparison prevents strong claims about relative model merit.
major comments (2)
- [Abstract and experimental results] The central claim that Linear SVC 'significantly outperformed' IndoBERT (abstract and results) rests on an uncontrolled variable: baselines are trained on the full Tokopedia corpus while IndoBERT uses only a sampled subset. Because accuracy and Macro F1 are known to improve with training-set size for both linear and transformer models, the reported gap cannot be attributed to architecture or loss design. No ablation that equalizes the number of training examples is presented.
- [Results and discussion] The manuscript notes the sampling discrepancy as the primary driver of the gap but still presents the raw numbers as a model comparison. Without a controlled experiment (or at minimum a quantitative estimate of the performance delta attributable to data volume alone), the benchmarking contribution is weakened and the conclusion that traditional models are preferable cannot be drawn.
minor comments (2)
- [Experimental setup] The exact sampling procedure, subset size, and stratification method used for IndoBERT are not stated with sufficient precision to allow replication of the exact training regime.
- [Methods] Hyper-parameter search ranges, early-stopping criteria, and the precise form of the custom weighted cross-entropy loss (including how class weights were computed) should be reported in a table or appendix for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and precise comments on our manuscript. We agree that the uncontrolled difference in training data volume is a significant limitation that prevents strong claims about relative model performance, and we will revise the paper to qualify our results more carefully.
read point-by-point responses
-
Referee: [Abstract and experimental results] The central claim that Linear SVC 'significantly outperformed' IndoBERT (abstract and results) rests on an uncontrolled variable: baselines are trained on the full Tokopedia corpus while IndoBERT uses only a sampled subset. Because accuracy and Macro F1 are known to improve with training-set size for both linear and transformer models, the reported gap cannot be attributed to architecture or loss design. No ablation that equalizes the number of training examples is presented.
Authors: We agree that the reported gap cannot be attributed to model architecture or loss design alone. Although the manuscript already identifies the sampling discrepancy as the primary driver, we will revise the abstract to remove the phrasing 'significantly outperformed' when describing the models and instead report the observed metrics under their respective data regimes. We will also add an explicit statement that the results do not constitute a controlled model comparison. revision: yes
-
Referee: [Results and discussion] The manuscript notes the sampling discrepancy as the primary driver of the gap but still presents the raw numbers as a model comparison. Without a controlled experiment (or at minimum a quantitative estimate of the performance delta attributable to data volume alone), the benchmarking contribution is weakened and the conclusion that traditional models are preferable cannot be drawn.
Authors: We accept this assessment. In the revised manuscript we will reframe the results and discussion sections to present the numbers strictly as observed performances under unequal data conditions, remove any implication that traditional models are preferable in general, and add a dedicated limitations paragraph on the need for equalized training-set sizes in future benchmarking work. revision: yes
- We are unable to perform a controlled ablation that equalizes the number of training examples or to supply a quantitative estimate of the performance delta due solely to data volume, because fine-tuning IndoBERT on the full corpus exceeds our available computational resources.
Circularity Check
No circularity: empirical metrics obtained from held-out evaluation
full rationale
The manuscript is a standard empirical benchmarking study. It trains Logistic Regression, Linear SVC, Naive Bayes on TF-IDF features and fine-tunes IndoBERT, then reports Accuracy, Macro F1 and Weighted F1 on held-out test data. No equations, derivations or first-principles claims are present; performance numbers are not defined in terms of themselves, nor are fitted parameters renamed as predictions. The authors explicitly attribute the observed gap to unequal training-set sizes (full corpus vs. sampled subset) rather than asserting a controlled architectural comparison. No self-citations, uniqueness theorems or ansatzes are invoked as load-bearing steps. The result is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- class weights
Reference graph
Works this paper leans on
-
[1]
Foundations and Trends in Information Retrieval , volume=
Opinion mining and sentiment analysis , author=. Foundations and Trends in Information Retrieval , volume=
-
[2]
2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA) , pages=
Improving bi-lstm performance for Indonesian sentiment analysis using paragraph vector , author=. 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA) , pages=. 2019 , organization=
2019
-
[3]
European Conference on Machine Learning , pages=
Text categorization with support vector machines: Learning with many relevant features , author=. European Conference on Machine Learning , pages=. 1998 , organization=
1998
-
[4]
Information Processing & Management , volume=
Term-weighting approaches in automatic text retrieval , author=. Information Processing & Management , volume=
-
[5]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=
work page internal anchor Pith review arXiv
-
[6]
arXiv preprint arXiv:2009.05387 , year=
IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding , author=. arXiv preprint arXiv:2009.05387 , year=
-
[7]
arXiv preprint arXiv:2011.00677 , year=
IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP , author=. arXiv preprint arXiv:2011.00677 , year=
-
[8]
ACM Computing Surveys , volume=
Deep learning--based text classification: A comprehensive review , author=. ACM Computing Surveys , volume=
-
[9]
2025 , publisher=
Tokopedia Products Dataset (2025) , author=. 2025 , publisher=
2025
-
[10]
Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild
Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild , author=. arXiv preprint arXiv:1906.02569 , year=
work page Pith review arXiv 1906
-
[11]
ACM SIGKDD Explorations Newsletter , volume=
Special issue on learning from imbalanced data sets , author=. ACM SIGKDD Explorations Newsletter , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.