Enhancing Game Review Sentiment Classification on Steam Platform with Attention-Based BiLSTM

Abit Ahmad Oktarian; Ardika Satria; Dhafin Razaqa Luthfi; Fadhil Fitra Wijaya; Luluk Muthoharoh; Martin Clinton Tosima Manullang

arxiv: 2605.01315 · v1 · submitted 2026-05-02 · 💻 cs.CL

Enhancing Game Review Sentiment Classification on Steam Platform with Attention-Based BiLSTM

Abit Ahmad Oktarian , Fadhil Fitra Wijaya , Dhafin Razaqa Luthfi , Luluk Muthoharoh , Ardika Satria , Martin Clinton Tosima Manullang This is my paper

Pith reviewed 2026-05-09 14:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords sentiment analysisBiLSTMattention mechanismSteam platformgame reviewsnatural language processingclass imbalancedeep learning

0 comments

The pith

An attention-based BiLSTM model achieves 83% accuracy on classifying sentiments in Steam game reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether a bidirectional LSTM network equipped with an attention layer can reliably determine whether Steam game reviews are positive or negative. The authors sample 50,000 reviews, train the model with a weighted loss function to correct for the fact that positive reviews dominate the data, and report 83 percent accuracy together with 85 percent weighted F1 on a held-out test set. Attention weights are visualized to show which words the model focuses on for its decisions. Compared with a TF-IDF baseline and an AutoML pipeline, the neural model performs better, especially at catching negative reviews with 90 percent recall. The goal is to give game makers an automated way to understand large amounts of player feedback.

Core claim

The authors establish that their BiLSTM model augmented with attention and trained using class-weighted cross-entropy loss attains 83% accuracy, 85% weighted F1-score, and 90% recall for negative reviews on a test set drawn from 50,000 Steam reviews, while the attention component supplies interpretable weightings over words in each review.

What carries the argument

The attention layer added to the BiLSTM that dynamically weights the importance of different words within each review when computing the sentiment prediction.

If this is right

Developers receive an automated tool that surfaces negative feedback with high recall.
Attention maps let developers see exactly which phrases in reviews trigger negative classifications.
The weighted loss strategy mitigates the common problem of positive reviews outnumbering negative ones in online platforms.
The deep learning pipeline outperforms both TF-IDF vectorization and automated machine learning baselines on this corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same architecture could be transferred to review sentiment tasks on other user-generated content platforms.
Extending the model to multi-class sentiment or aspect-based sentiment analysis would provide finer-grained insights into player opinions.
Periodic retraining on newer reviews would be needed to maintain performance as language and game trends evolve.
Integrating the attention weights with review metadata such as playtime or review length could improve overall utility.

Load-bearing premise

The sampled 50,000 reviews adequately represent the broader Steam review distribution and the model will continue to perform well on new reviews without major changes in review language or game types.

What would settle it

Evaluating the trained model on a fresh collection of several thousand Steam reviews posted after the original sampling date and observing whether accuracy falls substantially below 83 percent.

Figures

Figures reproduced from arXiv: 2605.01315 by Abit Ahmad Oktarian, Ardika Satria, Dhafin Razaqa Luthfi, Fadhil Fitra Wijaya, Luluk Muthoharoh, Martin Clinton Tosima Manullang.

**Figure 1.** Figure 1: Training and validation loss curves over epochs. view at source ↗

**Figure 2.** Figure 2: Confusion matrix of the BiLSTM+Attention model on the test set. view at source ↗

**Figure 3.** Figure 3: Attention heatmaps highlighting the model’s focus on sentiment-bearing words for positive (top) and negative view at source ↗

read the original abstract

This paper investigates sentiment classification of Steam game reviews using an attention-based Bidirectional Long Short-Term Memory (BiLSTM) model. Using a dataset of 50,000 reviews sampled from a larger Steam review corpus, the authors compare a traditional machine learning baseline based on TF-IDF and PyCaret AutoML with a deep learning approach implemented in PyTorch. The proposed BiLSTM+Attention model is trained with class-weighted cross-entropy to address class imbalance and achieves 83% accuracy and 85% weighted F1-score on the test set, with 90% recall for negative reviews. The paper also presents attention visualizations to show interpretability by highlighting sentiment-bearing words. The study concludes that the BiLSTM+Attention model is effective for analyzing user sentiment in Steam reviews and useful for helping developers understand player feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A straightforward BiLSTM+Attention model for Steam review sentiment that reports good metrics but lacks crucial details on data sampling and generalization.

read the letter

This is basically a standard attention-based BiLSTM model applied to sentiment classification on a sample of Steam game reviews. It gets 83% accuracy and 85% weighted F1 after using class weights for imbalance, and includes some attention visualizations for interpretability. What the paper does well is show a practical comparison to a TF-IDF baseline using PyCaret, and highlight how the model can point to specific words driving the sentiment. That could be useful for game developers monitoring player feedback. The main issue is the lack of detail around the data. The 50,000 reviews are sampled from a larger corpus, but there's no info on the sampling method, class balance in the sample, or whether the split was random or time-aware. Without that, the test set performance might not generalize well to future reviews. No hyperparameter details or error analysis either, which leaves the results a bit hard to evaluate fully. This kind of work is aimed at people in the gaming industry who need tools to process reviews quickly, not at advancing core NLP research. The approach is solid on its own terms but doesn't introduce anything new methodologically. I'd bring this to a reading group focused on applied NLP or industry use cases. I wouldn't cite it in my own papers. It deserves a serious referee once the authors add the missing experimental details, as the core experiment seems honest and the practical angle has some value.

Referee Report

2 major / 2 minor

Summary. The paper claims that an attention-based BiLSTM model, trained with class-weighted cross-entropy loss on a sample of 50,000 Steam reviews, outperforms a TF-IDF + PyCaret AutoML baseline for sentiment classification. It reports 83% accuracy, 85% weighted F1-score, and 90% recall for negative reviews on a held-out test set, along with attention visualizations to highlight sentiment-bearing words and improve interpretability.

Significance. If the performance generalizes, the work shows that BiLSTM with attention can provide stronger and more interpretable results than AutoML baselines for game review sentiment analysis, which may help developers extract actionable player feedback. The use of conventional metrics on a held-out set and the inclusion of attention maps are strengths that support practical utility in the domain.

major comments (2)

[Dataset section] The sampling procedure for the 50,000 reviews (random, stratified, temporal, or otherwise) and the train/test split details are not described. This is load-bearing for the central claim because the 83% accuracy and 85% weighted F1 on the held-out portion cannot be taken as evidence of effectiveness or generalization without confirming that the sample represents the full Steam corpus and that the test distribution matches future reviews.
[Experimental Setup and Results sections] No information is given on hyperparameter tuning, preprocessing pipeline, or statistical testing of the improvement over the baseline. These omissions undermine the ability to reproduce the result or confirm that the BiLSTM+Attention model is reliably superior rather than benefiting from unstated implementation choices.

minor comments (2)

[Abstract] The abstract states the model is 'effective' but does not quantify the baseline performance numbers for direct comparison.
[Results] An error analysis or confusion matrix would clarify why negative recall reaches 90% while overall accuracy is 83%.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. These points identify key areas where additional detail will strengthen the reproducibility and interpretability of our results. We address each major comment below and will revise the manuscript to incorporate the requested information.

read point-by-point responses

Referee: [Dataset section] The sampling procedure for the 50,000 reviews (random, stratified, temporal, or otherwise) and the train/test split details are not described. This is load-bearing for the central claim because the 83% accuracy and 85% weighted F1 on the held-out portion cannot be taken as evidence of effectiveness or generalization without confirming that the sample represents the full Steam corpus and that the test distribution matches future reviews.

Authors: We agree that the current description of the dataset is insufficient for assessing representativeness and generalization. In the revised manuscript we will expand the Dataset section with a complete account of the sampling procedure used to obtain the 50,000 reviews from the larger Steam corpus and the precise train/test split (including ratio, randomization seed if any, and whether stratification by label or other variables was applied). This addition will allow readers to evaluate whether the held-out test distribution is appropriate for the claims made. revision: yes
Referee: [Experimental Setup and Results sections] No information is given on hyperparameter tuning, preprocessing pipeline, or statistical testing of the improvement over the baseline. These omissions undermine the ability to reproduce the result or confirm that the BiLSTM+Attention model is reliably superior rather than benefiting from unstated implementation choices.

Authors: We acknowledge that the Experimental Setup and Results sections lack these critical details. We will revise both sections to document the full preprocessing pipeline (tokenization, vocabulary construction, sequence padding, and any text cleaning steps), the hyperparameter tuning approach and the specific configurations explored, and the statistical tests performed to compare the BiLSTM+Attention model against the TF-IDF + PyCaret baseline. These changes will support reproducibility and allow readers to judge the reliability of the reported performance gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics on held-out test data with no self-referential derivations

full rationale

The paper's core claims consist of measured performance numbers (83% accuracy, 85% weighted F1, 90% negative recall) obtained by training the BiLSTM+Attention model on a train split and evaluating on an explicit held-out test split of the 50k sampled reviews. No equations, derivations, or self-citations are present that reduce these metrics to quantities defined by the fitted parameters themselves or by prior work from the same authors. The class-weighted cross-entropy loss and attention mechanism are standard architectural choices whose outputs are evaluated externally against ground-truth labels rather than being tautological. The baseline comparison (TF-IDF + PyCaret) is likewise a direct empirical contrast. The derivation chain is therefore self-contained against external benchmarks and contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on standard supervised-learning assumptions plus the representativeness of the sampled dataset; no free parameters are explicitly fitted in the abstract beyond routine model weights and class weights.

free parameters (1)

class weights
Introduced to counter class imbalance; specific values not stated in abstract but required for the reported training procedure.

axioms (1)

domain assumption Dataset labels provide reliable ground truth for positive versus negative sentiment.
The evaluation metrics assume the provided review labels are accurate and consistent.

pith-pipeline@v0.9.0 · 5463 in / 1251 out tokens · 56888 ms · 2026-05-09T14:55:58.434896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Indonesian Journal of Global Discourse , volume=

Analisis Perkembangan Industri Game di Indonesia Melalui Pendekatan Rantai Nilai Global (Global Value Chain) , author=. Indonesian Journal of Global Discourse , volume=

work page
[2]

Innovative: Journal Of Social Science Research , volume=

Penerapan Fitur Ekstraksi TF-IDF untuk Analisis Sentimen Ulasan Game Bus Simulator Indonesia dengan Algoritma Naive Bayes , author=. Innovative: Journal Of Social Science Research , volume=

work page
[3]

JATI (Jurnal Mahasiswa Teknik Informatika) , volume=

Analisis sentimen pada game eFootball di Google Play Store menggunakan algoritma IndoBERT , author=. JATI (Jurnal Mahasiswa Teknik Informatika) , volume=

work page
[4]

Analisis Sentimen Review Publik Pengguna Game Online Pada Platform Steam Menggunakan Algoritma Na

Pangestu, Adhi and Arifin, Yoseph Tajul and Safitri, Rizky Ade , journal=. Analisis Sentimen Review Publik Pengguna Game Online Pada Platform Steam Menggunakan Algoritma Na

work page
[5]

The Indonesian Journal of Computer Science , volume=

Studi Empiris Model BERT dan DistilBERT Analisis Sentimen pada Pemilihan Presiden Indonesia , author=. The Indonesian Journal of Computer Science , volume=

work page
[6]

2018 , publisher=

Machine Learning for Text , author=. 2018 , publisher=

work page 2018
[7]

Information Processing & Management , volume=

Term-weighting approaches in automatic text retrieval , author=. Information Processing & Management , volume=. 1988 , publisher=

work page 1988
[8]

Advances in Neural Information Processing Systems 30 (NIPS 2017) , pages=

LightGBM: A Highly Efficient Gradient Boosting Decision Tree , author=. Advances in Neural Information Processing Systems 30 (NIPS 2017) , pages=

work page 2017
[9]

Neural Computation , year=

Long Short-Term Memory , author=. Neural Computation , year=

work page
[10]

ICLR , year=

Neural Machine Translation by Jointly Learning to Align and Translate , author=. ICLR , year=

work page
[11]

Journal of Machine Learning Technologies , year=

Evaluation: From Precision, Recall and F-Measure to ROC , author=. Journal of Machine Learning Technologies , year=

work page
[12]

2021 , howpublished=

Pham, Quoc-Anh and others , title=. 2021 , howpublished=

work page 2021

[1] [1]

Indonesian Journal of Global Discourse , volume=

Analisis Perkembangan Industri Game di Indonesia Melalui Pendekatan Rantai Nilai Global (Global Value Chain) , author=. Indonesian Journal of Global Discourse , volume=

work page

[2] [2]

Innovative: Journal Of Social Science Research , volume=

Penerapan Fitur Ekstraksi TF-IDF untuk Analisis Sentimen Ulasan Game Bus Simulator Indonesia dengan Algoritma Naive Bayes , author=. Innovative: Journal Of Social Science Research , volume=

work page

[3] [3]

JATI (Jurnal Mahasiswa Teknik Informatika) , volume=

Analisis sentimen pada game eFootball di Google Play Store menggunakan algoritma IndoBERT , author=. JATI (Jurnal Mahasiswa Teknik Informatika) , volume=

work page

[4] [4]

Analisis Sentimen Review Publik Pengguna Game Online Pada Platform Steam Menggunakan Algoritma Na

Pangestu, Adhi and Arifin, Yoseph Tajul and Safitri, Rizky Ade , journal=. Analisis Sentimen Review Publik Pengguna Game Online Pada Platform Steam Menggunakan Algoritma Na

work page

[5] [5]

The Indonesian Journal of Computer Science , volume=

Studi Empiris Model BERT dan DistilBERT Analisis Sentimen pada Pemilihan Presiden Indonesia , author=. The Indonesian Journal of Computer Science , volume=

work page

[6] [6]

2018 , publisher=

Machine Learning for Text , author=. 2018 , publisher=

work page 2018

[7] [7]

Information Processing & Management , volume=

Term-weighting approaches in automatic text retrieval , author=. Information Processing & Management , volume=. 1988 , publisher=

work page 1988

[8] [8]

Advances in Neural Information Processing Systems 30 (NIPS 2017) , pages=

LightGBM: A Highly Efficient Gradient Boosting Decision Tree , author=. Advances in Neural Information Processing Systems 30 (NIPS 2017) , pages=

work page 2017

[9] [9]

Neural Computation , year=

Long Short-Term Memory , author=. Neural Computation , year=

work page

[10] [10]

ICLR , year=

Neural Machine Translation by Jointly Learning to Align and Translate , author=. ICLR , year=

work page

[11] [11]

Journal of Machine Learning Technologies , year=

Evaluation: From Precision, Recall and F-Measure to ROC , author=. Journal of Machine Learning Technologies , year=

work page

[12] [12]

2021 , howpublished=

Pham, Quoc-Anh and others , title=. 2021 , howpublished=

work page 2021