pith. sign in

arxiv: 2604.24720 · v1 · submitted 2026-04-27 · 💻 cs.CL

Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

Pith reviewed 2026-05-08 03:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords sentiment classificationemotion classificationIndonesian languageBiLSTMmulti-task learninge-commerce reviewsAutoMLtext preprocessing
0
0 comments X

The pith

A multi-task BiLSTM with shared encoder classifies both sentiment and emotion from Indonesian e-commerce reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a two-track system to label Indonesian marketplace reviews for binary sentiment and one of five emotions, addressing the mix of standard words, slang, loanwords, and emojis that defeat simpler lexicon tools. One track runs TF-IDF features through an AutoML sweep of ordinary classifiers, while the other trains a bidirectional LSTM that shares one encoder across the two tasks but routes to separate output heads. The authors apply fourteen cleaning steps, including a custom slang dictionary, then benchmark several BiLSTM sizes plus a TextCNN model on 5,400 labeled reviews drawn from twenty-nine product categories. If the shared-encoder design works well, platforms could run both analyses from a single model rather than maintaining separate pipelines.

Core claim

The paper claims that a PyTorch BiLSTM network equipped with a shared bidirectional encoder and two task-specific heads can jointly classify binary sentiment and five-class emotion on cleaned Indonesian review text, and that this multi-task setup, when trained with class-weighted loss and early stopping, provides a practical alternative to TF-IDF plus AutoML while handling the language's informal features through fourteen sequential preprocessing steps.

What carries the argument

The multi-task BiLSTM with a shared bidirectional LSTM encoder feeding two separate output heads for sentiment and emotion, allowing joint feature learning while permitting task-specific specialization.

If this is right

  • The shared encoder captures text features that serve both sentiment and emotion tasks without needing duplicate processing.
  • The slang dictionary and fourteen-step cleaning reduce the effect of regional and informal terms on final accuracy.
  • Class-weighted cross-entropy loss and ReduceLROnPlateau scheduling help the model cope with uneven distribution across emotion categories.
  • Benchmarking baseline, improved, and large BiLSTM variants plus TextCNN identifies practical model-size trade-offs for this data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shared-encoder pattern could transfer to other paired classification problems such as topic plus sentiment in other informal language settings.
  • Adding explicit emoji tokenization might raise emotion-class accuracy on reviews that rely on emoticons for tone.
  • Real-time deployment on live review streams could let platforms flag negative or angry feedback without separate sentiment and emotion services.

Load-bearing premise

The dataset labels accurately reflect the true sentiment and emotions in the reviews, and the fourteen preprocessing steps together with the slang dictionary sufficiently normalize the informal language variations.

What would settle it

Applying the trained model to a new collection of Indonesian e-commerce reviews gathered from a different platform, then measuring agreement between its predictions and independent human labels on the same texts, would directly test whether the classifications remain reliable.

Figures

Figures reproduced from arXiv: 2604.24720 by Ahmad Rizqi, Hermawan Manurung, Ibrahim Al-Kahfi, Martin Clinton Tosima Manullang.

Figure 1
Figure 1. Figure 1: Label distribution for sentiment (left, binary) and emotion (right, five-class). The 19.8 pp gap between Happy view at source ↗
Figure 2
Figure 2. Figure 2: Review counts across 29 product categories. The narrow range per category confirms the stratified design of view at source ↗
Figure 3
Figure 3. Figure 3: AutoML evaluation leaderboard for sentiment classification. The best-performing classical baseline is selected view at source ↗
Figure 4
Figure 4. Figure 4: AutoML comparison for five-class emotion classification. Macro-F1 is used as the primary ranking metric view at source ↗
Figure 5
Figure 5. Figure 5: Normalized confusion matrix for five-class emotion classification on the 539-review test split. view at source ↗
read the original abstract

Indonesian marketplace reviews mix standard vocabulary with slang, regional loanwords, numeric shorthands, and emoji, making lexicon-based sentiment tools unreliable in practice. This paper describes a two-track classification pipeline applied to the PRDECT-ID dataset, which contains 5,400 product reviews from 29 Indonesian e-commerce categories, each labeled for binary sentiment (Positive/Negative) and five-class emotion (Happy, Sad, Fear, Love, Anger). The first track applies TF-IDF vectorization with a PyCaret AutoML sweep across standard classifiers. The second track is a PyTorch Bidirectional Long Short-Term Memory (BiLSTM) network with a shared encoder and two task-specific output heads. A preprocessing module applies 14 sequential cleaning steps, including a 140-entry slang dictionary assembled from marketplace corpora. Four configurations are benchmarked: BiLSTM Baseline, BiLSTM Improved, BiLSTM Large, and TextCNN. Training uses class-weighted cross-entropy loss, ReduceLROnPlateau scheduling, and early stopping. Both tracks are deployed as Gradio applications on Hugging Face Spaces. Source code is publicly available at https://github.com/ikii-sd/pba2026-crazyrichteam.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper describes a two-track classification pipeline for binary sentiment and five-class emotion labeling on the PRDECT-ID dataset of 5,400 Indonesian e-commerce reviews. Track one applies TF-IDF vectorization followed by a PyCaret AutoML sweep over standard classifiers. Track two implements a PyTorch multi-task BiLSTM with shared encoder and task-specific heads, plus variants (Baseline, Improved, Large) and a TextCNN comparator. A 14-step preprocessing pipeline with a 140-entry slang dictionary is used; training employs class-weighted cross-entropy, ReduceLROnPlateau, and early stopping. Models are deployed as Gradio apps with public code.

Significance. If the unreported benchmarks were to show competitive or improved performance, the work would supply a practical, reproducible pipeline for noisy Indonesian marketplace text and lower the barrier for e-commerce sentiment applications in a low-resource language. The public GitHub repository and Hugging Face deployment are concrete strengths that aid reproducibility.

major comments (2)
  1. [Experimental Results] Experimental Results section: the manuscript states that four configurations are benchmarked yet reports no accuracy, F1, or other metrics, no ablation tables, and no comparison between the AutoML track and the BiLSTM variants. Without these numbers the central benchmarking claim cannot be evaluated.
  2. [Dataset and Preprocessing] Dataset and Preprocessing section: the pipeline assumes the PRDECT-ID labels accurately reflect true sentiment and emotion and that the 14-step cleaning plus slang dictionary sufficiently normalize marketplace Indonesian, but no inter-annotator agreement, label-error analysis, or ablation on the slang dictionary is provided. This assumption is load-bearing for any performance claims.
minor comments (3)
  1. [Abstract] Abstract: mentions benchmarking four configurations but supplies no headline results or key findings, which is atypical for an abstract and reduces its utility.
  2. [Methods] Methods: the 14 sequential preprocessing steps are enumerated in prose; a compact table or pseudocode listing would improve clarity and reproducibility.
  3. [Related Work] Related Work: additional citations to prior multi-task BiLSTM or Indonesian sentiment papers would better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and commit to revisions that strengthen the reporting of results and the discussion of dataset assumptions.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: the manuscript states that four configurations are benchmarked yet reports no accuracy, F1, or other metrics, no ablation tables, and no comparison between the AutoML track and the BiLSTM variants. Without these numbers the central benchmarking claim cannot be evaluated.

    Authors: We agree that the Experimental Results section as currently written does not include the required numerical metrics or comparisons. This omission prevents evaluation of the central claims. In the revised manuscript we will add full performance tables reporting accuracy, macro-F1, precision, and recall for the AutoML track and all BiLSTM variants (Baseline, Improved, Large) plus the TextCNN comparator. We will also include ablation tables and explicit side-by-side comparisons between the two tracks. revision: yes

  2. Referee: [Dataset and Preprocessing] Dataset and Preprocessing section: the pipeline assumes the PRDECT-ID labels accurately reflect true sentiment and emotion and that the 14-step cleaning plus slang dictionary sufficiently normalize marketplace Indonesian, but no inter-annotator agreement, label-error analysis, or ablation on the slang dictionary is provided. This assumption is load-bearing for any performance claims.

    Authors: The PRDECT-ID dataset is a publicly released resource whose labels we adopt as ground truth, which is standard for benchmarking studies on this corpus. We will add a limitations paragraph noting that the original dataset release does not report inter-annotator agreement and discussing possible label noise. We will also include (1) an ablation experiment comparing model performance with and without the slang dictionary and (2) a qualitative label-error analysis of misclassified examples. These additions will make the preprocessing assumptions more transparent and testable. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper applies standard supervised learning pipelines (TF-IDF + PyCaret AutoML sweep and a multi-task BiLSTM with shared encoder) to an externally provided labeled dataset (PRDECT-ID). No equations, fitted parameters, or results are redefined as predictions; no self-citations are invoked as load-bearing uniqueness theorems; and no ansatzes or renamings reduce the central claims to the inputs by construction. The derivation chain consists of conventional preprocessing, training, and benchmarking steps that remain independent of the target outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The pipeline rests on standard supervised learning assumptions and the quality of the external PRDECT-ID labels; no new free parameters, invented entities, or ad-hoc axioms beyond domain-standard preprocessing choices.

axioms (1)
  • domain assumption The PRDECT-ID dataset labels accurately reflect sentiment and emotion categories
    Training and evaluation of both tracks depend directly on these labels.

pith-pipeline@v0.9.0 · 5535 in / 1026 out tokens · 53875 ms · 2026-05-08T03:28:37.793524+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    Sutoyo, A

    E. Sutoyo, A. Almaarif, and A. Kurniawan. PRDECT-ID : Indonesian product reviews dataset for emotions classification tasks. Data in Brief, 44: 0 108554, 2022

  2. [2]

    Pramana et al

    D. Pramana et al. A comparison of BiLSTM , BERT , and ensemble method for emotion recognition on indonesian product reviews. Procedia Computer Science, 245: 0 848--857, 2024

  3. [3]

    Emotion detection using contextual embeddings for indonesian product review texts on e-commerce platform

    Amelia Devi Putri Ariyanto, Fari Katul Fikriah, and Arif Fitra Setyawan. Emotion detection using contextual embeddings for indonesian product review texts on e-commerce platform. Pixel: Jurnal Ilmiah Komputer Grafis, 17 0 (1): 0 179--185, July 2024. ISSN 1979-0414. doi:10.51903/pixel.v17i1.2010

  4. [4]

    Comparative study of BiLSTM and GRU for sentiment analysis on indonesian e-commerce product reviews using deep sequential modeling

    Khairunnisa Nasution, Khairun Saddami, Roslidar Roslidar, Akhyar Akhyar, Fathurrahman Fathurrahman, and Niza Aulia. Comparative study of BiLSTM and GRU for sentiment analysis on indonesian e-commerce product reviews using deep sequential modeling. Jurnal Teknik Informatika (Jutif), 6 0 (4): 0 1881--1896, August 2025. ISSN 2723-3863. doi:10.52436/1.jutif.2...

  5. [5]

    Analisis sentimen ulasan pengguna gopay di google play store menggunakan model IndoELECTRA

    Lisna Rahma Fitriati, Rangga Gelar Guntara, and Btari Mariska Purwaamijaya. Analisis sentimen ulasan pengguna gopay di google play store menggunakan model IndoELECTRA . Jurnal Algoritma, 22 0 (2), December 2025. ISSN 1412-3622. doi:10.33364/algoritma/v.22-2.3053

  6. [6]

    Cahyaningtyas, I

    C. Cahyaningtyas, I. Budi, and A. F. Wicaksono. Deep learning for aspect-based sentiment analysis on indonesian hotels reviews. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 6 0 (3): 0 221--230, 2021

  7. [7]

    Analisis sentimen wacana pemindahan ibu kota indonesia menggunakan algoritma support vector machine (svm), 2021

    Primandani Arsi and Retno Waluyo. Analisis sentimen wacana pemindahan ibu kota indonesia menggunakan algoritma support vector machine (svm), 2021. ISSN 2528-6579. URL https://t.co/xAg6sVNC36. Indonesian sentiment-analysis study using SVM

  8. [8]

    Glenn et al

    J. Glenn et al. Emotion classification of indonesian tweets using bidirectional LSTM . Neural Computing and Applications, 35: 0 15089--15105, 2023

  9. [9]

    A twitter sentiments analysis on islamic banking using drone emprit academic (dea): Evidence from indonesia, 2023

    Nadia Nurul Izza, Mia Sari, Mughnii Kahila, and Solahuddin Al-Ayubi. A twitter sentiments analysis on islamic banking using drone emprit academic (dea): Evidence from indonesia, 2023. Sentiment-analysis study on Indonesian Twitter discussions

  10. [10]

    Research on sentimental evaluation of e-commerce product reviews based on the BiLSTM -attention mechanism

    Yuhang Wang. Research on sentimental evaluation of e-commerce product reviews based on the BiLSTM -attention mechanism. Technical report, Business, Economics and Management REPGU, 2026