100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts

Rustem Yeshpanov

arxiv: 2605.08600 · v2 · submitted 2026-05-09 · 💻 cs.CL

100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts

Rustem Yeshpanov This is my paper

Pith reviewed 2026-05-13 07:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords movie reviewssentiment analysisKazakh languageRussian languagecode-switchingmultilingual corpuspolarity classificationtransformer models

0 comments

The pith

A new corpus of over 100,000 Kazakh movie reviews supports benchmarking of multilingual sentiment models on polarity and score tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases a collection of 100,502 movie reviews gathered from a Kazakhstan site and covering nearly 5,000 films from 2001 to 2025. Most texts are in Russian, but the set also includes Kazakh and code-switched examples, each labeled by hand for language and three-way sentiment polarity, with a subset carrying explicit user ratings from one to five. Two tasks are defined: classifying polarity as positive, negative or neutral, and assigning the five-level score. Classical bag-of-words and TF-IDF methods are compared to multilingual transformers such as mBERT and XLM-RoBERTa. The transformers show higher accuracy on polarity detection, while the score task stays difficult because of uneven class sizes and small differences between adjacent ratings.

Core claim

We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outper

What carries the argument

The new multilingual movie review corpus with manual language and polarity annotations, serving as the testbed for comparing classical text classifiers to transformer models on three-class polarity and five-class score tasks.

Load-bearing premise

The manual annotations for language and sentiment are accurate and consistent across annotators, and the evaluation setup prevents data leakage between train and test sets.

What would settle it

A replication experiment that shows classical baselines match or beat transformer accuracy on the polarity task when using the same leakage-controlled train-test splits would falsify the claimed performance advantage.

Figures

Figures reproduced from arXiv: 2605.08600 by Rustem Yeshpanov.

**Figure 1.** Figure 1: Kazakh-language review share over time Kazakh lexical root with Russian negation and imperative morphology; and еркеки (“men”), formed using a Kazakh lexical root combined with a Russian plural inflection. Such phenomena make the data collected particularly valuable for studying regional language variation, code-switching, and culturally grounded named entity usage in real-world user-generated text. 3.… view at source ↗

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases a solid new Kazakh movie review corpus but leaves the annotation process undocumented.

read the letter

The main takeaway is a new public dataset of 100,502 movie reviews from Kazakhstan, mostly Russian but with Kazakh and code-switched examples, manually labeled for language and sentiment polarity, plus a subset with explicit ratings. The authors set up polarity and score classification tasks and show transformers beating bag-of-words baselines on the former while the latter stays hard due to imbalance and fine distinctions. Leakage control in the splits is a responsible touch. What works here is the scale and the focus on an under-resourced language pair with real code-switching data; that fills a practical gap for multilingual sentiment work. The benchmarks are standard and the results line up with what one would expect. The clear soft spot is the annotation details. The paper states the labels are manual but reports no inter-annotator agreement, no annotator count, no guidelines, and no adjudication process. In code-switched text, where polarity can be ambiguous, that omission makes it harder to trust how much the model gaps reflect signal versus label noise. This is the sort of resource paper that low-resource NLP groups will actually use and cite once the data is out. It is not a deep theoretical advance, but the data itself is new and usable. I would bring it to a reading group on datasets or multilingual methods. It deserves peer review so the authors can add the missing annotation protocol without losing the core value of the release.

Referee Report

1 major / 1 minor

Summary. The paper claims to introduce a large new multilingual corpus of 100,502 movie reviews from Kazakhstan, including Russian, Kazakh, and code-switched content, manually annotated for language and sentiment polarity, with a subset of 11,309 reviews having user ratings. It defines two tasks—three-way polarity classification and five-class score classification—and benchmarks classical BoW/TF-IDF methods against multilingual transformer models like mBERT, XLM-RoBERTa, and RemBERT, concluding that transformers outperform on polarity while score classification is difficult due to imbalance and subtle rating distinctions under leakage-controlled evaluation.

Significance. This work provides a valuable public resource for sentiment analysis in low-resource and code-switched settings involving Kazakh and Russian, which are underrepresented in NLP datasets. The scale (over 100k reviews across 4,943 titles from 2001-2025) enables robust benchmarking and potential studies on language dynamics. The emphasis on leakage-controlled evaluation strengthens the empirical findings, and the release of the dataset could spur further research in multilingual and Central Asian NLP.

major comments (1)

[Data collection and annotation] The manuscript states that reviews are 'manually annotated for language and sentiment polarity' (abstract) but provides no details on the annotation process, including the number of annotators, inter-annotator agreement metrics (e.g., Cohen's kappa or Fleiss' kappa), annotation guidelines, or adjudication procedures for disagreements. This is a load-bearing issue for the central benchmark claims, as the reported superiority of transformer models on polarity classification relies entirely on the accuracy and consistency of these labels, particularly given the challenges of code-switching where polarity and language identification can be ambiguous.

minor comments (1)

[Experimental setup] The abstract mentions 'leakage-controlled evaluation' but the full manuscript should explicitly describe the method used to prevent leakage (e.g., splitting by title, user, or time) to allow full reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the description of the annotation process.

read point-by-point responses

Referee: The manuscript states that reviews are 'manually annotated for language and sentiment polarity' (abstract) but provides no details on the annotation process, including the number of annotators, inter-annotator agreement metrics (e.g., Cohen's kappa or Fleiss' kappa), annotation guidelines, or adjudication procedures for disagreements. This is a load-bearing issue for the central benchmark claims, as the reported superiority of transformer models on polarity classification relies entirely on the accuracy and consistency of these labels, particularly given the challenges of code-switching where polarity and language identification can be ambiguous.

Authors: We agree that the manuscript currently lacks these essential details on the annotation process, which are important for assessing label quality and reliability, especially in code-switched settings. In the revised version, we will add a dedicated subsection in the dataset section describing the annotation methodology. This will include the number of annotators, the specific guidelines provided to them, inter-annotator agreement metrics (e.g., Cohen's or Fleiss' kappa), and the procedures used to adjudicate disagreements. These additions will directly address the concern and better support the validity of the polarity classification benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: standard data release and empirical benchmark

full rationale

The paper releases a new corpus of 100k+ movie reviews, describes manual annotation for language and polarity, defines two classification tasks, and reports direct experimental comparisons between classical baselines and multilingual transformers. No equations, derivations, fitted parameters presented as predictions, or self-citations appear in the provided text. The central claims rest on empirical results on the released data rather than any reduction to inputs by construction. This matches the expected non-circular outcome for a dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities; the contribution is the dataset itself with standard annotation assumptions.

axioms (1)

domain assumption Manual annotations for language and sentiment are reliable and consistent.
The paper relies on this for the validity of the labeled data.

pith-pipeline@v0.9.0 · 5434 in / 1022 out tokens · 35258 ms · 2026-05-13T07:37:16.512995+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

2002 , pages =

Pang, Bo and Lee, Lillian and Vaithyanathan, Shivakumar , booktitle =. 2002 , pages =. doi:10.3115/1118693.1118704 , url =

work page doi:10.3115/1118693.1118704 2002
[2]

and Daly, Raymond E

Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , booktitle =. 2011 , pages =

work page 2011
[3]

Maas, Andrew , year =

work page
[4]

and Ng, Andrew and Potts, Christopher , booktitle =

Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher , booktitle =. 2013 , pages =

work page 2013
[5]

Blinov, PD and Klekovkina, Maria and Kotelnikov, Eugeny and Pestov, Oleg , journal=

work page
[6]

2018 , pages =

Rogers, Anna and Romanov, Alexey and Rumshisky, Anna and Volkova, Svitlana and Gronas, Mikhail and Gribov, Alex , booktitle =. 2018 , pages =

work page 2018
[7]

2024 , pages =

Yeshpanov, Rustem and Varol, Huseyin Atakan , booktitle =. 2024 , pages =

work page 2024
[8]

Proceedings of the Fourteenth Workshop on Semantic Evaluation , year =

Patwa, Parth and Aguilar, Gustavo and Kar, Sudipta and Pandey, Suraj and PYKL, Srinivas and Gamb. Proceedings of the Fourteenth Workshop on Semantic Evaluation , year =

work page
[9]

Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

Krasitskii, Mikhail and Kolesnikova, Olga and Chanona Hernandez, Liliana and Sidorov, Grigori and Gelbukh, Alexander. Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities. 2025. doi:10.18653/v1/2025.nlp4dh-1.27

work page doi:10.18653/v1/2025.nlp4dh-1.27 2025
[10]

Richard and Koch, Gary G

Landis, J. Richard and Koch, Gary G. , journal=

work page
[11]

, title =

Jurafsky, Daniel and Martin, James H. , title =. 2009 , isbn =

work page 2009
[12]

Yang, Yiming , booktitle=

work page
[13]

Conference on Empirical Methods in Natural Language Processing , year=

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. Conference on Empirical Methods in Natural Language Processing , year=

work page
[14]

ArXiv , year=

Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco (Paco) Guzm. ArXiv , year=

work page
[15]

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

work page
[16]

Unsupervised cross-lingual representation learning at scale

Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzm. Proceedings of the Association for Computational Linguistics (ACL) , pages =. 2020 , url =. doi:10.18653/v1/2020.acl-main.747 , timestamp =

work page doi:10.18653/v1/2020.acl-main.747 2020
[17]

2021 , url=

Hyung Won Chung and Thibault Fevry and Henry Tsai and Melvin Johnson and Sebastian Ruder , booktitle=. 2021 , url=

work page 2021
[18]

Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019

work page 2019
[19]

1988 , volume=

Gerard Salton and Chris Buckley , journal=. 1988 , volume=

work page 1988
[20]

1999 , url=

Thorsten Joachims , booktitle=. 1999 , url=

work page 1999
[21]

Pang, Bo and Lee, Lillian , title =. Found. Trends Inf. Retr. , month = jan, pages =. 2008 , issue_date =. doi:10.1561/1500000011 , abstract =

work page doi:10.1561/1500000011 2008

[1] [1]

2002 , pages =

Pang, Bo and Lee, Lillian and Vaithyanathan, Shivakumar , booktitle =. 2002 , pages =. doi:10.3115/1118693.1118704 , url =

work page doi:10.3115/1118693.1118704 2002

[2] [2]

and Daly, Raymond E

Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , booktitle =. 2011 , pages =

work page 2011

[3] [3]

Maas, Andrew , year =

work page

[4] [4]

and Ng, Andrew and Potts, Christopher , booktitle =

Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher , booktitle =. 2013 , pages =

work page 2013

[5] [5]

Blinov, PD and Klekovkina, Maria and Kotelnikov, Eugeny and Pestov, Oleg , journal=

work page

[6] [6]

2018 , pages =

Rogers, Anna and Romanov, Alexey and Rumshisky, Anna and Volkova, Svitlana and Gronas, Mikhail and Gribov, Alex , booktitle =. 2018 , pages =

work page 2018

[7] [7]

2024 , pages =

Yeshpanov, Rustem and Varol, Huseyin Atakan , booktitle =. 2024 , pages =

work page 2024

[8] [8]

Proceedings of the Fourteenth Workshop on Semantic Evaluation , year =

Patwa, Parth and Aguilar, Gustavo and Kar, Sudipta and Pandey, Suraj and PYKL, Srinivas and Gamb. Proceedings of the Fourteenth Workshop on Semantic Evaluation , year =

work page

[9] [9]

Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

Krasitskii, Mikhail and Kolesnikova, Olga and Chanona Hernandez, Liliana and Sidorov, Grigori and Gelbukh, Alexander. Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities. 2025. doi:10.18653/v1/2025.nlp4dh-1.27

work page doi:10.18653/v1/2025.nlp4dh-1.27 2025

[10] [10]

Richard and Koch, Gary G

Landis, J. Richard and Koch, Gary G. , journal=

work page

[11] [11]

, title =

Jurafsky, Daniel and Martin, James H. , title =. 2009 , isbn =

work page 2009

[12] [12]

Yang, Yiming , booktitle=

work page

[13] [13]

Conference on Empirical Methods in Natural Language Processing , year=

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. Conference on Empirical Methods in Natural Language Processing , year=

work page

[14] [14]

ArXiv , year=

Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco (Paco) Guzm. ArXiv , year=

work page

[15] [15]

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

work page

[16] [16]

Unsupervised cross-lingual representation learning at scale

Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzm. Proceedings of the Association for Computational Linguistics (ACL) , pages =. 2020 , url =. doi:10.18653/v1/2020.acl-main.747 , timestamp =

work page doi:10.18653/v1/2020.acl-main.747 2020

[17] [17]

2021 , url=

Hyung Won Chung and Thibault Fevry and Henry Tsai and Melvin Johnson and Sebastian Ruder , booktitle=. 2021 , url=

work page 2021

[18] [18]

Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019

work page 2019

[19] [19]

1988 , volume=

Gerard Salton and Chris Buckley , journal=. 1988 , volume=

work page 1988

[20] [20]

1999 , url=

Thorsten Joachims , booktitle=. 1999 , url=

work page 1999

[21] [21]

Pang, Bo and Lee, Lillian , title =. Found. Trends Inf. Retr. , month = jan, pages =. 2008 , issue_date =. doi:10.1561/1500000011 , abstract =

work page doi:10.1561/1500000011 2008