pith. sign in

arxiv: 2605.07793 · v1 · submitted 2026-05-08 · 💻 cs.CL

Hybrid TF--IDF Logistic Regression and MLP Neural Baseline for Indonesian Three-Class Sentiment Analysis on Social Media Text

Pith reviewed 2026-05-11 02:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords Indonesian sentiment analysisTF-IDFlogistic regressionMLPsocial mediabaseline modelthree-class classification
0
0 comments X

The pith

For small Indonesian social media datasets, a TF-IDF logistic regression model delivers 80 percent accuracy on three-class sentiment analysis and serves as a practical production baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a compact baseline for three-class sentiment analysis on Indonesian social media text by remapping fine-grained emotion labels to positive, negative, and neutral. It shows that combining TF-IDF text features with simple numeric metadata and a balanced multinomial logistic regression classifier achieves strong results on a cleaned dataset of 707 samples. This matters because it demonstrates that interpretable, lightweight models can compete effectively for small, imbalanced datasets without needing complex neural architectures in production settings. The work also compares this to an MLP neural baseline that reaches higher accuracy but is not recommended for deployment.

Core claim

The central claim is that careful preprocessing, interpretable feature engineering with TF-IDF and metadata features, and class balancing allow a logistic regression model to reach 0.8028 accuracy, 0.8003 weighted F1, and 0.7276 macro F1 on the three-class task, making it suitable for production while the higher-accuracy MLP is treated as a comparative experiment only.

What carries the argument

The hybrid feature set of TF-IDF vectors plus three lightweight numeric metadata features fed into a balanced multinomial logistic regression classifier, compared against a two-layer MLP on the same features.

If this is right

  • Careful preprocessing and class balancing improve performance on imbalanced sentiment data.
  • Interpretable models like logistic regression remain competitive with neural networks for small datasets.
  • Neural baselines like MLP should be used for comparison rather than direct deployment in resource-constrained settings.
  • Remapping fine-grained emotions to sentiment classes enables effective three-class analysis even with limited samples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar hybrid approaches could apply to other low-resource languages with limited labeled social media data.
  • Adding more metadata features might improve the logistic regression without added complexity.
  • Testing the remapped labels against direct three-class annotations would confirm reliability.

Load-bearing premise

That remapping the original 191 fine-grained emotion labels to three sentiment classes produces reliable positive, negative, and neutral targets without substantial label noise or loss of meaning in the final 707 samples.

What would settle it

Evaluating the logistic regression model on a new, independently annotated three-class Indonesian social media dataset would determine if the reported accuracy holds beyond the remapped labels.

Figures

Figures reproduced from arXiv: 2605.07793 by Allya Nurul Islami Pasha, Ardika Satria, Eka Fidiya Putri, Luluk Muthoharoh, Martin C.T. Manullang.

Figure 1
Figure 1. Figure 1: Class distribution after preprocessing. The positive class dominates the dataset, while the neutral class remains [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end methodology pipeline exported to [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark comparison exported from paper_outputs/model_benchmark_bar_chart.png. Linear SVM yields the best overall preserved performance, while Random Forest performs worst in macro F1 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Confusion matrix of the Logistic Regression model exported in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

This paper presents a compact three-class sentiment analysis study for Indonesian social media text. The task is formulated with positive, negative, and neutral outputs derived from a fine-grained emotion dataset. The proposed practical baseline combines TF--IDF text features, three lightweight numeric metadata features, and a balanced multinomial Logistic Regression classifier. For comparison, the study also includes a neural baseline using a two-layer multilayer perceptron (MLP) over the same hybrid feature representation. The dataset originally contains 732 rows and 191 fine-grained emotion labels; after cleaning, deduplication, and label remapping, 707 samples remain with an imbalanced distribution of 459 positive, 188 negative, and 60 neutral instances. Experimental results show that the Logistic Regression deployment model reaches 0.8028 accuracy, 0.8003 weighted F1, and 0.7276 macro F1, while project documentation reports a higher-accuracy but non-production MLP baseline. These findings indicate that careful preprocessing, interpretable feature engineering, and class balancing remain competitive for small Indonesian sentiment datasets, whereas the neural baseline is better treated as a comparative experiment than as the default deployment model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a compact study on three-class sentiment analysis for Indonesian social media text. It derives positive/negative/neutral labels by remapping a fine-grained emotion dataset (originally 732 rows with 191 labels, reduced to 707 samples with 459/188/60 distribution after cleaning), then trains a hybrid TF-IDF plus three numeric metadata features model using balanced multinomial Logistic Regression, reporting 0.8028 accuracy, 0.8003 weighted F1 and 0.7276 macro F1; an MLP neural baseline is included for comparison but deemed non-production.

Significance. If the label remapping is faithful, the work supplies a practical, interpretable baseline that shows careful preprocessing and class balancing can remain competitive for small imbalanced Indonesian datasets, offering a reproducible alternative to neural models for deployment scenarios where transparency matters.

major comments (2)
  1. [§2] §2 (Dataset preparation): the remapping of the original 191 fine-grained emotion labels to the three sentiment classes is described only at summary level with no explicit mapping table, decision criteria, or validation steps (e.g., spot-checks or inter-rater agreement). Given the tiny neutral class (60 samples), even modest label noise would directly undermine the reported macro F1 of 0.7276 and the claim that the LR model is a reliable baseline.
  2. [§3] §3 (Experimental setup): the manuscript supplies no information on the train-test split ratio, cross-validation procedure, exact class-balancing method (weights or sampling), or feature-selection steps. These omissions leave the headline metrics (0.8028 accuracy, 0.8003 weighted F1) only partially supported and difficult to reproduce or compare.
minor comments (1)
  1. [Abstract] Abstract: the MLP baseline is said to be 'higher-accuracy' yet no numeric values are given, only a reference to 'project documentation'; this should be stated explicitly or removed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important gaps in documentation. We address each major point below and have revised the manuscript to improve reproducibility and transparency.

read point-by-point responses
  1. Referee: [§2] §2 (Dataset preparation): the remapping of the original 191 fine-grained emotion labels to the three sentiment classes is described only at summary level with no explicit mapping table, decision criteria, or validation steps (e.g., spot-checks or inter-rater agreement). Given the tiny neutral class (60 samples), even modest label noise would directly undermine the reported macro F1 of 0.7276 and the claim that the LR model is a reliable baseline.

    Authors: We agree that the label remapping was described at too high a level. In the revised manuscript we will insert an explicit table mapping all 191 original emotion labels to the three sentiment classes, together with the decision criteria (valence-based grouping) and any validation performed (manual spot-checks on a subset). We also acknowledge the small neutral class (60 samples) and will add an explicit discussion of how label noise in this class could affect macro F1 and the reliability of the baseline. revision: yes

  2. Referee: [§3] §3 (Experimental setup): the manuscript supplies no information on the train-test split ratio, cross-validation procedure, exact class-balancing method (weights or sampling), or feature-selection steps. These omissions leave the headline metrics (0.8028 accuracy, 0.8003 weighted F1) only partially supported and difficult to reproduce or compare.

    Authors: We accept that these procedural details were omitted. The revised version will state the train-test split (80:20 stratified), note that cross-validation was not used because of the small dataset size, specify the class-balancing method (inverse-frequency class weights inside the multinomial logistic regression), and describe feature handling (all TF-IDF terms retained after preprocessing plus the three metadata features with no further selection). These additions will make the reported metrics fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard ML evaluation pipeline with independent metrics

full rationale

The paper follows a conventional supervised learning workflow: start with raw dataset (732 rows, 191 emotion labels), apply cleaning/deduplication/remapping to 707 samples with 3-class targets, extract hybrid TF-IDF + metadata features, train LR and MLP classifiers, and report hold-out or cross-validated accuracy/F1 scores. No equations, fitted parameters, or self-citations reduce the final metrics (0.8028 accuracy, 0.7276 macro F1) to the inputs by construction. The label-remapping step is a preprocessing assumption whose validity can be externally checked but does not create definitional or predictive circularity. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of label remapping from fine-grained emotions and on standard supervised-learning assumptions about the cleaned dataset.

free parameters (1)
  • class balancing weights
    Class balancing is invoked to address the 459/188/60 imbalance; the specific weights or sampling ratios are not stated and therefore function as free parameters.
axioms (1)
  • domain assumption Remapping 191 fine-grained emotion labels to three sentiment classes preserves semantic validity
    The task formulation explicitly derives positive/negative/neutral targets from the original emotion dataset.

pith-pipeline@v0.9.0 · 5534 in / 1326 out tokens · 56872 ms · 2026-05-11T02:58:13.298198+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    2008 , publisher=

    Opinion Mining and Sentiment Analysis , author=. 2008 , publisher=

  2. [2]

    Ain Shams Engineering Journal , volume=

    Sentiment analysis algorithms and applications: A survey , author=. Ain Shams Engineering Journal , volume=

  3. [3]

    Proceedings of EMNLP , pages=

    Thumbs up? Sentiment Classification using Machine Learning Techniques , author=. Proceedings of EMNLP , pages=

  4. [4]

    Proceedings of ECML , pages=

    Text Categorization with Support Vector Machines: Learning with Many Relevant Features , author=. Proceedings of ECML , pages=

  5. [5]

    International Journal of Machine Learning and Cybernetics , volume=

    Understanding bag-of-words model: a statistical framework , author=. International Journal of Machine Learning and Cybernetics , volume=

  6. [6]

    IEEE Access , volume=

    A survey of sentiment analysis: approaches, datasets, and open challenges , author=. IEEE Access , volume=

  7. [7]

    Efficient Estimation of Word Representations in Vector Space

    Efficient Estimation of Word Representations in Vector Space , author=. arXiv preprint arXiv:1301.3781 , year=

  8. [8]

    Proceedings of EMNLP , pages=

    Convolutional Neural Networks for Sentence Classification , author=. Proceedings of EMNLP , pages=

  9. [9]

    Neural Computation , volume=

    Long Short-Term Memory , author=. Neural Computation , volume=

  10. [10]

    Proceedings of NAACL-HLT , pages=

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. Proceedings of NAACL-HLT , pages=

  11. [11]

    Proceedings of AACL-IJCNLP , pages=

    IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding , author=. Proceedings of AACL-IJCNLP , pages=

  12. [12]

    2016 , publisher=

    Deep Learning , author=. 2016 , publisher=

  13. [13]

    Advances in Neural Information Processing Systems , volume=

    Hidden Technical Debt in Machine Learning Systems , author=. Advances in Neural Information Processing Systems , volume=

  14. [14]

    Queue , volume=

    The Mythos of Model Interpretability , author=. Queue , volume=