pith. sign in

arxiv: 2605.08600 · v2 · submitted 2026-05-09 · 💻 cs.CL

100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts

Pith reviewed 2026-05-13 07:37 UTC · model grok-4.3

classification 💻 cs.CL
keywords movie reviewssentiment analysisKazakh languageRussian languagecode-switchingmultilingual corpuspolarity classificationtransformer models
0
0 comments X

The pith

A new corpus of over 100,000 Kazakh movie reviews supports benchmarking of multilingual sentiment models on polarity and score tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases a collection of 100,502 movie reviews gathered from a Kazakhstan site and covering nearly 5,000 films from 2001 to 2025. Most texts are in Russian, but the set also includes Kazakh and code-switched examples, each labeled by hand for language and three-way sentiment polarity, with a subset carrying explicit user ratings from one to five. Two tasks are defined: classifying polarity as positive, negative or neutral, and assigning the five-level score. Classical bag-of-words and TF-IDF methods are compared to multilingual transformers such as mBERT and XLM-RoBERTa. The transformers show higher accuracy on polarity detection, while the score task stays difficult because of uneven class sizes and small differences between adjacent ratings.

Core claim

We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outper

What carries the argument

The new multilingual movie review corpus with manual language and polarity annotations, serving as the testbed for comparing classical text classifiers to transformer models on three-class polarity and five-class score tasks.

Load-bearing premise

The manual annotations for language and sentiment are accurate and consistent across annotators, and the evaluation setup prevents data leakage between train and test sets.

What would settle it

A replication experiment that shows classical baselines match or beat transformer accuracy on the polarity task when using the same leakage-controlled train-test splits would falsify the claimed performance advantage.

Figures

Figures reproduced from arXiv: 2605.08600 by Rustem Yeshpanov.

Figure 1
Figure 1. Figure 1: Kazakh-language review share over time Kazakh lexical root with Russian negation and im￾perative morphology; and еркеки (“men”), formed using a Kazakh lexical root combined with a Rus￾sian plural inflection. Such phenomena make the data collected particularly valuable for studying re￾gional language variation, code-switching, and cul￾turally grounded named entity usage in real-world user-generated text. 3.… view at source ↗
read the original abstract

We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to introduce a large new multilingual corpus of 100,502 movie reviews from Kazakhstan, including Russian, Kazakh, and code-switched content, manually annotated for language and sentiment polarity, with a subset of 11,309 reviews having user ratings. It defines two tasks—three-way polarity classification and five-class score classification—and benchmarks classical BoW/TF-IDF methods against multilingual transformer models like mBERT, XLM-RoBERTa, and RemBERT, concluding that transformers outperform on polarity while score classification is difficult due to imbalance and subtle rating distinctions under leakage-controlled evaluation.

Significance. This work provides a valuable public resource for sentiment analysis in low-resource and code-switched settings involving Kazakh and Russian, which are underrepresented in NLP datasets. The scale (over 100k reviews across 4,943 titles from 2001-2025) enables robust benchmarking and potential studies on language dynamics. The emphasis on leakage-controlled evaluation strengthens the empirical findings, and the release of the dataset could spur further research in multilingual and Central Asian NLP.

major comments (1)
  1. [Data collection and annotation] The manuscript states that reviews are 'manually annotated for language and sentiment polarity' (abstract) but provides no details on the annotation process, including the number of annotators, inter-annotator agreement metrics (e.g., Cohen's kappa or Fleiss' kappa), annotation guidelines, or adjudication procedures for disagreements. This is a load-bearing issue for the central benchmark claims, as the reported superiority of transformer models on polarity classification relies entirely on the accuracy and consistency of these labels, particularly given the challenges of code-switching where polarity and language identification can be ambiguous.
minor comments (1)
  1. [Experimental setup] The abstract mentions 'leakage-controlled evaluation' but the full manuscript should explicitly describe the method used to prevent leakage (e.g., splitting by title, user, or time) to allow full reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the description of the annotation process.

read point-by-point responses
  1. Referee: The manuscript states that reviews are 'manually annotated for language and sentiment polarity' (abstract) but provides no details on the annotation process, including the number of annotators, inter-annotator agreement metrics (e.g., Cohen's kappa or Fleiss' kappa), annotation guidelines, or adjudication procedures for disagreements. This is a load-bearing issue for the central benchmark claims, as the reported superiority of transformer models on polarity classification relies entirely on the accuracy and consistency of these labels, particularly given the challenges of code-switching where polarity and language identification can be ambiguous.

    Authors: We agree that the manuscript currently lacks these essential details on the annotation process, which are important for assessing label quality and reliability, especially in code-switched settings. In the revised version, we will add a dedicated subsection in the dataset section describing the annotation methodology. This will include the number of annotators, the specific guidelines provided to them, inter-annotator agreement metrics (e.g., Cohen's or Fleiss' kappa), and the procedures used to adjudicate disagreements. These additions will directly address the concern and better support the validity of the polarity classification benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: standard data release and empirical benchmark

full rationale

The paper releases a new corpus of 100k+ movie reviews, describes manual annotation for language and polarity, defines two classification tasks, and reports direct experimental comparisons between classical baselines and multilingual transformers. No equations, derivations, fitted parameters presented as predictions, or self-citations appear in the provided text. The central claims rest on empirical results on the released data rather than any reduction to inputs by construction. This matches the expected non-circular outcome for a dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities; the contribution is the dataset itself with standard annotation assumptions.

axioms (1)
  • domain assumption Manual annotations for language and sentiment are reliable and consistent.
    The paper relies on this for the validity of the labeled data.

pith-pipeline@v0.9.0 · 5434 in / 1022 out tokens · 35258 ms · 2026-05-13T07:37:16.512995+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    2002 , pages =

    Pang, Bo and Lee, Lillian and Vaithyanathan, Shivakumar , booktitle =. 2002 , pages =. doi:10.3115/1118693.1118704 , url =

  2. [2]

    and Daly, Raymond E

    Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , booktitle =. 2011 , pages =

  3. [3]

    Maas, Andrew , year =

  4. [4]

    and Ng, Andrew and Potts, Christopher , booktitle =

    Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher , booktitle =. 2013 , pages =

  5. [5]

    Blinov, PD and Klekovkina, Maria and Kotelnikov, Eugeny and Pestov, Oleg , journal=

  6. [6]

    2018 , pages =

    Rogers, Anna and Romanov, Alexey and Rumshisky, Anna and Volkova, Svitlana and Gronas, Mikhail and Gribov, Alex , booktitle =. 2018 , pages =

  7. [7]

    2024 , pages =

    Yeshpanov, Rustem and Varol, Huseyin Atakan , booktitle =. 2024 , pages =

  8. [8]

    Proceedings of the Fourteenth Workshop on Semantic Evaluation , year =

    Patwa, Parth and Aguilar, Gustavo and Kar, Sudipta and Pandey, Suraj and PYKL, Srinivas and Gamb. Proceedings of the Fourteenth Workshop on Semantic Evaluation , year =

  9. [9]

    Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

    Krasitskii, Mikhail and Kolesnikova, Olga and Chanona Hernandez, Liliana and Sidorov, Grigori and Gelbukh, Alexander. Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities. 2025. doi:10.18653/v1/2025.nlp4dh-1.27

  10. [10]

    Richard and Koch, Gary G

    Landis, J. Richard and Koch, Gary G. , journal=

  11. [11]

    , title =

    Jurafsky, Daniel and Martin, James H. , title =. 2009 , isbn =

  12. [12]

    Yang, Yiming , booktitle=

  13. [13]

    Conference on Empirical Methods in Natural Language Processing , year=

    Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. Conference on Empirical Methods in Natural Language Processing , year=

  14. [14]

    ArXiv , year=

    Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco (Paco) Guzm. ArXiv , year=

  15. [15]

    Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

  16. [16]

    Unsupervised cross-lingual representation learning at scale

    Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzm. Proceedings of the Association for Computational Linguistics (ACL) , pages =. 2020 , url =. doi:10.18653/v1/2020.acl-main.747 , timestamp =

  17. [17]

    2021 , url=

    Hyung Won Chung and Thibault Fevry and Henry Tsai and Melvin Johnson and Sebastian Ruder , booktitle=. 2021 , url=

  18. [18]

    Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019

  19. [19]

    1988 , volume=

    Gerard Salton and Chris Buckley , journal=. 1988 , volume=

  20. [20]

    1999 , url=

    Thorsten Joachims , booktitle=. 1999 , url=

  21. [21]

    Pang, Bo and Lee, Lillian , title =. Found. Trends Inf. Retr. , month = jan, pages =. 2008 , issue_date =. doi:10.1561/1500000011 , abstract =