100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts
Pith reviewed 2026-05-13 07:37 UTC · model grok-4.3
The pith
A new corpus of over 100,000 Kazakh movie reviews supports benchmarking of multilingual sentiment models on polarity and score tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outper
What carries the argument
The new multilingual movie review corpus with manual language and polarity annotations, serving as the testbed for comparing classical text classifiers to transformer models on three-class polarity and five-class score tasks.
Load-bearing premise
The manual annotations for language and sentiment are accurate and consistent across annotators, and the evaluation setup prevents data leakage between train and test sets.
What would settle it
A replication experiment that shows classical baselines match or beat transformer accuracy on the polarity task when using the same leakage-controlled train-test splits would falsify the claimed performance advantage.
Figures
read the original abstract
We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce a large new multilingual corpus of 100,502 movie reviews from Kazakhstan, including Russian, Kazakh, and code-switched content, manually annotated for language and sentiment polarity, with a subset of 11,309 reviews having user ratings. It defines two tasks—three-way polarity classification and five-class score classification—and benchmarks classical BoW/TF-IDF methods against multilingual transformer models like mBERT, XLM-RoBERTa, and RemBERT, concluding that transformers outperform on polarity while score classification is difficult due to imbalance and subtle rating distinctions under leakage-controlled evaluation.
Significance. This work provides a valuable public resource for sentiment analysis in low-resource and code-switched settings involving Kazakh and Russian, which are underrepresented in NLP datasets. The scale (over 100k reviews across 4,943 titles from 2001-2025) enables robust benchmarking and potential studies on language dynamics. The emphasis on leakage-controlled evaluation strengthens the empirical findings, and the release of the dataset could spur further research in multilingual and Central Asian NLP.
major comments (1)
- [Data collection and annotation] The manuscript states that reviews are 'manually annotated for language and sentiment polarity' (abstract) but provides no details on the annotation process, including the number of annotators, inter-annotator agreement metrics (e.g., Cohen's kappa or Fleiss' kappa), annotation guidelines, or adjudication procedures for disagreements. This is a load-bearing issue for the central benchmark claims, as the reported superiority of transformer models on polarity classification relies entirely on the accuracy and consistency of these labels, particularly given the challenges of code-switching where polarity and language identification can be ambiguous.
minor comments (1)
- [Experimental setup] The abstract mentions 'leakage-controlled evaluation' but the full manuscript should explicitly describe the method used to prevent leakage (e.g., splitting by title, user, or time) to allow full reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the description of the annotation process.
read point-by-point responses
-
Referee: The manuscript states that reviews are 'manually annotated for language and sentiment polarity' (abstract) but provides no details on the annotation process, including the number of annotators, inter-annotator agreement metrics (e.g., Cohen's kappa or Fleiss' kappa), annotation guidelines, or adjudication procedures for disagreements. This is a load-bearing issue for the central benchmark claims, as the reported superiority of transformer models on polarity classification relies entirely on the accuracy and consistency of these labels, particularly given the challenges of code-switching where polarity and language identification can be ambiguous.
Authors: We agree that the manuscript currently lacks these essential details on the annotation process, which are important for assessing label quality and reliability, especially in code-switched settings. In the revised version, we will add a dedicated subsection in the dataset section describing the annotation methodology. This will include the number of annotators, the specific guidelines provided to them, inter-annotator agreement metrics (e.g., Cohen's or Fleiss' kappa), and the procedures used to adjudicate disagreements. These additions will directly address the concern and better support the validity of the polarity classification benchmarks. revision: yes
Circularity Check
No circularity: standard data release and empirical benchmark
full rationale
The paper releases a new corpus of 100k+ movie reviews, describes manual annotation for language and polarity, defines two classification tasks, and reports direct experimental comparisons between classical baselines and multilingual transformers. No equations, derivations, fitted parameters presented as predictions, or self-citations appear in the provided text. The central claims rest on empirical results on the released data rather than any reduction to inputs by construction. This matches the expected non-circular outcome for a dataset paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Manual annotations for language and sentiment are reliable and consistent.
Reference graph
Works this paper leans on
-
[1]
Pang, Bo and Lee, Lillian and Vaithyanathan, Shivakumar , booktitle =. 2002 , pages =. doi:10.3115/1118693.1118704 , url =
-
[2]
Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , booktitle =. 2011 , pages =
work page 2011
-
[3]
Maas, Andrew , year =
-
[4]
and Ng, Andrew and Potts, Christopher , booktitle =
Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher , booktitle =. 2013 , pages =
work page 2013
-
[5]
Blinov, PD and Klekovkina, Maria and Kotelnikov, Eugeny and Pestov, Oleg , journal=
-
[6]
Rogers, Anna and Romanov, Alexey and Rumshisky, Anna and Volkova, Svitlana and Gronas, Mikhail and Gribov, Alex , booktitle =. 2018 , pages =
work page 2018
-
[7]
Yeshpanov, Rustem and Varol, Huseyin Atakan , booktitle =. 2024 , pages =
work page 2024
-
[8]
Proceedings of the Fourteenth Workshop on Semantic Evaluation , year =
Patwa, Parth and Aguilar, Gustavo and Kar, Sudipta and Pandey, Suraj and PYKL, Srinivas and Gamb. Proceedings of the Fourteenth Workshop on Semantic Evaluation , year =
-
[9]
Krasitskii, Mikhail and Kolesnikova, Olga and Chanona Hernandez, Liliana and Sidorov, Grigori and Gelbukh, Alexander. Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities. 2025. doi:10.18653/v1/2025.nlp4dh-1.27
- [10]
- [11]
-
[12]
Yang, Yiming , booktitle=
-
[13]
Conference on Empirical Methods in Natural Language Processing , year=
Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. Conference on Empirical Methods in Natural Language Processing , year=
-
[14]
Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco (Paco) Guzm. ArXiv , year=
-
[15]
Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=
-
[16]
Unsupervised cross-lingual representation learning at scale
Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzm. Proceedings of the Association for Computational Linguistics (ACL) , pages =. 2020 , url =. doi:10.18653/v1/2020.acl-main.747 , timestamp =
-
[17]
Hyung Won Chung and Thibault Fevry and Henry Tsai and Melvin Johnson and Sebastian Ruder , booktitle=. 2021 , url=
work page 2021
-
[18]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019
work page 2019
- [19]
- [20]
-
[21]
Pang, Bo and Lee, Lillian , title =. Found. Trends Inf. Retr. , month = jan, pages =. 2008 , issue_date =. doi:10.1561/1500000011 , abstract =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.