Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models across Modalities
Pith reviewed 2026-05-18 09:25 UTC · model grok-4.3
The pith
A survey of 327 studies shows large language models still struggle with code-switched language inputs across modalities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper provides the first comprehensive analysis of CSW-aware LLM research by reviewing 327 studies spanning five research areas, 15+ NLP tasks, 30+ datasets, and 80+ languages. It categorizes recent advances by architecture, training strategy, and evaluation methodology, outlines how LLMs have reshaped CSW modeling, identifies the challenges that persist, and concludes with a roadmap that emphasizes the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual capabilities.
What carries the argument
Categorization of advances by architecture, training strategy, and evaluation methodology, together with a proposed roadmap for inclusive datasets and fair evaluation.
Load-bearing premise
The literature search and selection process captured a representative sample of all relevant code-switched LLM work without major omissions or language biases.
What would settle it
Discovery of a substantial body of relevant pre-2024 studies on code-switched LLMs that were omitted from the reviewed set of 327 papers would undermine the claim of comprehensive coverage.
Figures
read the original abstract
Amidst the rapid advances of large language models (LLMs), most LLMs still struggle with mixed-language inputs, limited Codeswitching (CSW) datasets, and evaluation biases, which hinder their deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing 327 studies spanning five research areas, 15+ NLP tasks, 30+ datasets, and 80+ languages. We categorize recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and identifying the challenges that persist. The paper concludes with a roadmap that emphasizes the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual capabilities https://github.com/lingo-iitgn/awesome-code-mixing/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This survey paper provides the first comprehensive analysis of code-switched (CSW) NLP research in the era of large language models (LLMs) across modalities. It reviews 327 studies from five research areas, covering 15+ NLP tasks, 30+ datasets, and 80+ languages. The paper categorizes recent advances by architecture, training strategy, and evaluation methodology, identifies ongoing challenges, and offers a roadmap for future work including the need for inclusive datasets, fair evaluation, and linguistically grounded models.
Significance. Assuming the reviewed studies form a representative sample, this work would be significant for the field by synthesizing the current state of CSW-aware LLM research, which is crucial for multilingual societies. It highlights how LLMs have reshaped CSW modeling and points to persistent issues like limited datasets and evaluation biases. The accompanying GitHub repository adds practical value by curating resources.
major comments (1)
- [Literature Search and Selection (likely in Methods or §2)] The manuscript states a review of 327 studies but provides only high-level descriptions of the search strategy without exact keywords, databases used, time window (e.g., post-2022), inclusion/exclusion criteria, or a PRISMA flow diagram. This undermines the ability to verify the claim of comprehensive coverage across 80+ languages and recent preprints, raising concerns about potential omissions or biases.
minor comments (2)
- [Abstract] The title mentions 'across Modalities' but the abstract and categorization seem primarily focused on text-based CSW; consider clarifying the extent of multimodal coverage in the survey.
- [Overall] Ensure consistent use of abbreviations like CSW and LLM throughout the paper.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our survey. We appreciate the opportunity to clarify our methodology and strengthen the manuscript accordingly. Below we provide a point-by-point response to the major comment.
read point-by-point responses
-
Referee: [Literature Search and Selection (likely in Methods or §2)] The manuscript states a review of 327 studies but provides only high-level descriptions of the search strategy without exact keywords, databases used, time window (e.g., post-2022), inclusion/exclusion criteria, or a PRISMA flow diagram. This undermines the ability to verify the claim of comprehensive coverage across 80+ languages and recent preprints, raising concerns about potential omissions or biases.
Authors: We acknowledge that the current description of the literature search process is high-level and agree that greater transparency is warranted for a survey of this scope. In the revised version, we will expand §2 (or add a dedicated 'Literature Review Methodology' subsection) to explicitly document: (1) the full set of search keywords and Boolean queries (e.g., combinations of 'code-switching' OR 'code-mixing' OR 'codeswitch' AND ('LLM' OR 'large language model' OR 'transformer' OR 'multilingual') with modality-specific terms); (2) the databases and sources queried (arXiv, ACL Anthology, Google Scholar, EMNLP/ACL/NAACL proceedings, and selected preprint servers); (3) the primary time window (January 2022–mid-2024 to focus on the LLM era, with selective inclusion of foundational pre-2022 works); and (4) clear inclusion/exclusion criteria (e.g., studies must address at least one NLP task involving code-switched data with LLMs or closely related models; non-NLP or monolingual-only works excluded). We will also add a PRISMA-style flow diagram in the appendix showing the progression from initial retrievals to the final 327 studies. These additions will directly address concerns about verifiability, coverage of 80+ languages, and inclusion of recent preprints while preserving the survey's narrative flow. revision: yes
Circularity Check
No circularity: survey aggregates external studies without self-referential derivations or fitted predictions
full rationale
This is a literature survey paper whose central claims consist of categorizing and summarizing 327 external studies across tasks, datasets, and languages. No equations, quantitative predictions, or derivations appear in the provided abstract or structure. The 'first comprehensive' framing rests on an external literature search whose replicability is a methodological concern but does not constitute a circular reduction of any result to the paper's own inputs by construction. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 327 selected studies constitute a sufficiently complete and unbiased sample of CSW-aware LLM research.
Forward citations
Cited by 1 Pith paper
-
Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs
A survey that unifies prior code-switching research for LLMs into a taxonomy of data, modeling, and evaluation and distills it into actionable recommendations for practitioners.
Reference graph
Works this paper leans on
-
[1]
LinCE: A centralized benchmark for linguis- tic code-switching evaluation. InProceedings of the Twelfth Language Resources and Evaluation Con- ference, pages 1803–1813, Marseille, France. Euro- pean Language Resources Association. Gustavo Aguilar and Thamar Solorio. 2020. From English to code-switching: Transfer learning with strong morphological clues. I...
work page 2020
-
[2]
Hope speech detection in code-mixed roman urdu tweets: A positive turn in natural language pro- cessing.ArXiv, abs/2506.21583. Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2024. MEGA VERSE: Benchmarking large language models acro...
-
[3]
Cross Script Hindi English NER Corpus from Wikipedia
Towards real-world streaming speech trans- lation for code-switched speech. InProceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 14–22, Singapore. Association for Computational Linguistics. Djegdjiga Amazouz, Martine Adda-Decker, and Lori Lamel. 2017. Addressing code-switching in french/algerian arabic speech. ...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
CoMix: Guide transformers to code-mix us- ing POS structure and phonetics. InFindings of the Association for Computational Linguistics: ACL 2023, pages 7985–8002, Toronto, Canada. Associa- tion for Computational Linguistics. Maria Riveena Arul, Vigneshwaran Shanmugasun- daram, S Rajalakshmi, Bharathi Raja Chakravarthi, and C. N. Subalalitha. 2025. MMS-5: ...
-
[5]
To ask LLMs about English grammaticality, prompt them in a different language. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15622–15634, Miami, Florida, USA. Association for Computational Linguistics. Astik Biswas, Febe de Wet, Ewald van der Westhuizen, and Thomas Niesler. 2020. Semi-supervised acous- tic and language mod...
work page 2024
-
[6]
Sharanya Chakravarthy, Anjana Umapathy, and Alan W Black
Dravidiancodemix: Sentiment analysis and offensive language identification dataset for dravid- ian languages in code-mixed text.Language Re- sources and Evaluation, 56(3):765–806. Sharanya Chakravarthy, Anjana Umapathy, and Alan W Black. 2020. Detecting entailment in code- mixed Hindi-English conversations. InProceedings of the Sixth Workshop on Noisy Use...
work page 2020
-
[7]
An Empirical Study of Smoothing Techniques for Language Modeling
ENTITYCS: Improving cross-lingual seman- tics with entity-level code-switching. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1812–1823, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Stanley F. Chen and Joshua T. Goodman. 1996. An em- pirical study of smoothing techniques fo...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Advancing multi-criteria Chinese word seg- mentation through criterion classification and de- noising. InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 6460–6476, Toronto, Canada. Association for Computational Linguistics. Helin Cihan, Yunhan Wu, Paola Peña, Justin Edwards, and Be...
work page 2022
-
[9]
to have the ‘million’ readers yet
IRNLP_DAIICT@LT-EDI-EACL2021: Hope speech detection in code mixed text using TF-IDF char n-grams and MuRIL. InProceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, pages 114–117, Kyiv. Association for Computational Linguistics. Oksana Dereza, Deirdre Ní Chonghaile, and Nicholas Wolf. 2024. “to have the ‘million’ r...
-
[10]
ContrastiveMix: Overcoming code-mixing dilemma in cross-lingual transfer for information retrieval. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 197– 204, Mexico City, Mexico. Association for Compu- tational Linguistics. Suma...
work page 2024
-
[11]
InFindings of the Association for Com- putational Linguistics: EMNLP 2022, pages 7820– 7832
Adapting multilingual models for code-mixed translation. InFindings of the Association for Com- putational Linguistics: EMNLP 2022, pages 7820– 7832. Akshat Gupta, Sargam Menghani, Sai Krishna Ral- labandi, and Alan W Black. 2021b. Unsuper- vised self-training for sentiment analysis of code- switched data. InProceedings of the Fifth Workshop on Computatio...
-
[12]
Are large language model-based evaluators the solution to scaling up multilingual evaluation? InFindings of the Association for Computational Linguistics: EACL 2024, pages 1051–1070, St. Ju- lian’s, Malta. Association for Computational Lin- guistics. Injy Hamed, Thang Vu, and Nizar Habash. 2025. The impact of code-switched synthetic data quality is task d...
-
[13]
PhraseOut: A code mixed data augmentation method for MultilingualNeural machine tranlsation. InProceedings of the 17th International Conference on Natural Language Processing (ICON), pages 470–474, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI). Ganesh Jawahar, El Moatez Billah Nagoudi, Muham- mad Abdul-Mageed, and La...
work page 2021
-
[14]
The evolution of code-switching in mul- tilingual societies: A sociolinguistic perspective: https://doi. org/10.55966/assaj. 2025.4. 1.054.AS- SAJ, 4(01):614–625. Aditya Joshi, Ameya Prabhu, Manish Shrivastava, and Vasudeva Varma. 2016. Towards sub-word level compositions for sentiment analysis of Hindi- English code mixed text. InProceedings of COL- ING ...
-
[15]
PolyWER: A holistic evaluation framework for code-switched speech recognition. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6144–6153, Miami, Florida, USA. Association for Computational Linguistics. Yeeun Kang. 2024. CoV oSwitch: Machine translation of synthetic code-switched text based on intonation units. InProceedings ...
-
[16]
Can you translate for me? code-switched machine translation with large language models. In Proceedings of the 13th International Joint Confer- ence on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the As- sociation for Computational Linguistics (Volume 2: Short Papers), pages 83–92, Nusa Dua, Bali. Asso- ciation for Com...
-
[17]
Prabhupadavani: A code-mixed speech trans- lation corpus. InProceedings of the 2024 Joint In- ternational Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 8142–8149, Torino, Italia. ELRA and ICCL. AK Indira Kumar, Gayathri Sthanusubramoni- ani, Deepa Gupta, Aarathi Rajagopalan Nair, Yousef Ajami Alota...
-
[18]
Cm_clip: Unveiling code-mixed multimodal learning with cross-lingual clip adaptations. In ICON. Garry Kuwanto, Chaitanya Agarwal, Genta Indra Winata, and Derry Tanti Wijaya. 2024. Linguistics theory meets llm: Code-switched text generation via equivalence constrained large language models. ArXiv, abs/2410.22660. Siyu Lai, Hui Huang, Dong Jing, Yufeng Chen...
-
[19]
COMMIT: Code-mixing English-centric large language model for multilingual instruction tuning. InFindings of the Association for Com- putational Linguistics: NAACL 2024, pages 3130– 3137, Mexico City, Mexico. Association for Com- putational Linguistics. Frances Adriana Laureano De Leon, Harish Tayyar Madabushi, and Mark Lee. 2024. Code-mixed probes show ho...
-
[20]
Boosting zero-shot cross-lingual retrieval by training on artificially code-switched data.ArXiv, abs/2305.05295. Hexin Liu, Haoyang Zhang, Qiquan Zhang, Xi- angyu Zhang, Dongyuan Shi, Eng Siong Chng, and Haizhou Li. 2025. Code-switching speech recogni- tion under the lens: Model-and data-centric perspec- tives.arXiv preprint arXiv:2509.24310. Ye Liu, Wolf...
-
[21]
Daniel Palomino and José Ochoa-Luna
Code-switching asr for low-resource indic languages: A hindi-marathi case study.IEEE Ac- cess, 13:9171–9198. Daniel Palomino and José Ochoa-Luna. 2020. Palomino-ochoa at SemEval-2020 task 9: Robust system based on transformer for code-mixed sentiment classification. InProceedings of the Fourteenth Workshop on Semantic Evaluation, pages 963–967, Barcelona ...
work page 2020
-
[22]
SemEval-2020 task 9: Overview of senti- ment analysis of code-mixed tweets. InProceed- ings of the Fourteenth Workshop on Semantic Eval- uation, pages 774–790, Barcelona (online). Interna- tional Committee for Computational Linguistics. Siyao Peng, Zihang Sun, Huangyan Shan, Marie Kolm, Verena Blaschke, Ekaterina Artemova, and Barbara Plank. 2024. Sebasti...
work page 2020
-
[23]
EmoMix-3L: A code-mixed dataset for Bangla-English-Hindi for emotion detection. In Proceedings of the 7th Workshop on Indian Lan- guage Data: Resources and Evaluation, pages 11– 16, Torino, Italia. ELRA and ICCL. Humair Raj Khan, Deepak Gupta, and Asif Ekbal
-
[24]
Towards developing a multilingual and code- mixed visual question answering system by knowl- edge distillation. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 1753–1767, Punta Cana, Dominican Republic. As- sociation for Computational Linguistics. SB Rajeshwari and Jagadish S Kallimani. 2021. Re- gional language code-switchi...
work page 2021
-
[25]
GCM: A toolkit for generating synthetic code-mixed text. InProceedings of the 16th Con- ference of the European Chapter of the Association for Computational Linguistics: System Demonstra- tions, pages 205–211, Online. Association for Com- putational Linguistics. Sumukh S and Manish Shrivastava. 2022. “kanglish alli names!” named entity recognition for Kan...
work page 2022
-
[26]
POS tagging of Hindi-English code mixed text from social media: Some machine learning ex- periments. InProceedings of the 12th International Conference on Natural Language Processing, pages 237–246, Trivandrum, India. NLP Association of In- dia. Sanket Shah, Pratik Joshi, Sebastin Santy, and Sunayana Sitaram. 2019. CoSSAT: Code-switched speech annotation ...
-
[27]
Test-time code-switching for cross-lingual as- pect sentiment triplet extraction. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long Papers), pages 5041–5053, Albuquerque, New Mexico. Association for Computational Linguistics. Mehak She...
work page 2025
-
[28]
CodeMixBench: A new benchmark for gen- erating code from code-mixed prompts. InProceed- ings of the 18th International Conference on Natural Language Processing (ICON), Vasco da Gama, Goa, India. NLP Association of India (NLPAI). Rajvee Sheth, Himanshu Beniwal, and Mayank Singh
-
[29]
Rajvee Sheth, Shubh Nisar, Heenaben Prajapati, Hi- manshu Beniwal, and Mayank Singh
Comi-lingua: Expert annotated large-scale dataset for multitask nlp in hindi-english code- mixing.arXiv preprint arXiv:2503.21670. Rajvee Sheth, Shubh Nisar, Heenaben Prajapati, Hi- manshu Beniwal, and Mayank Singh. 2024. Com- mentator: A code-mixed multilingual text annota- tion framework. InProceedings of the 2024 Con- ference on Empirical Methods in Na...
-
[30]
Improving sentiment analysis for Ukrainian social media code-switching data. InProceedings of the Fourth Ukrainian Natural Language Process- ing Workshop (UNLP 2025), pages 179–193, Vi- enna, Austria (online). Association for Computa- tional Linguistics. Rushendra Sidibomma, Pransh Patwa, Parth Patwa, Aman Chadha, Vinija Jain, and Amitava Das. 2025. LLMsA...
work page 2025
-
[31]
Thoudam Doren Singh and Thamar Solorio
Hiacc: Hinglish adult & children code- switched corpus.Data in Brief, page 111886. Thoudam Doren Singh and Thamar Solorio. 2017. To- wards translating mixed-code comments from social media. InInternational Conference on Computa- tional Linguistics and Intelligent Text Processing, pages 457–468. Springer. Vinay Singh, Deepanshu Vijay, Syed Sarfaraz Akhtar,...
work page 2017
-
[32]
Overview for the first shared task on language identification in code-switched data. InProceedings of the First Workshop on Computational Approaches to Code Switching, pages 62–72, Doha, Qatar. Asso- ciation for Computational Linguistics. Thamar Solorio and Yang Liu. 2008. Learning to pre- dict code-switching points. InProceedings of the 2008 Conference o...
-
[33]
From machine translation to code-switching: Generating high-quality code-switched text. InPro- ceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics and the 11th In- ternational Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3154– 3169, Online. Association for Computational Lin- guistics. Pa...
-
[34]
Adapting multilingual models for code-mixed translation. InFindings of the Association for Com- putational Linguistics: EMNLP 2022, pages 7133– 7141, Abu Dhabi, United Arab Emirates. Associa- tion for Computational Linguistics. Yogarshi Vyas, Spandana Gella, Jatin Sharma, Ka- lika Bali, and Monojit Choudhury. 2014. POS tag- ging of English-Hindi code-mixe...
work page 2022
-
[35]
Code-switched named entity recognition with embedding attention. InProceedings of the Third Workshop on Computational Approaches to Lin- guistic Code-Switching, pages 154–158, Melbourne, Australia. Association for Computational Linguis- tics. Fei Wang, Kuan-hao Huang, Anoop Kumar, Aram Galstyan, Greg Ver steeg, and Kai-wei Chang
-
[36]
Zero-shot cross-lingual sequence tagging as Seq2Seq generation for joint intent classification and slot filling. InProceedings of the Massively Multi- lingual Natural Language Understanding Workshop (MMNLU-22), pages 53–61, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computa- tional Linguistics. Renxi Wang, Haonan Li, Minghao Wu, Yuxia Wang,...
-
[37]
Association for Computational Linguistics
Are multilingual models effective in code- switching? InProceedings of the Fifth Workshop on Computational Approaches to Linguistic Code- Switching, pages 142–153, Online. Association for Computational Linguistics. Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. 2019. Code-switched lan- guage models using neural based synthetic data ...
-
[38]
32x improve- ment on MLQA (Lewis et al., 2020); demonstrates power of code-mixed instruction tuning NLG TasksMT (CoMix) (Arora et al., 2023); Summariza- tion (CroCoSum) (Zhang and Eickhoff, 2024); Dialogue Generation (X-RiSAWOZ: 18K+ utterances, 12 domains) (Moradshahi et al., 2023); Creative text (puns); CS prompting CroCoSum (24K En articles, 18K Chines...
work page 2020
-
[39]
BLEU: 12.98 with 10x efficiency; GPT-4 excels at cre- ative generation (puns, translation) NLU Tasks: Clas- sification Sentiment (Dravidian- CodeMix: 60K+ com- ments) (Chakravarthi et al., 2022); Offensive detection (OffMix-3L) (Goswami et al., 2023); Hate speech; Intent classification DravidianCodeMix (Tamil-En, Kannada- En, Malayalam-En) (Chakravarthi e...
work page 2022
-
[40]
introduces novel programming task Language Cover- age & Gaps High-resource: Hindi- En, Spanish-En; Dravid- ian: Tamil-En, Kannada- En, Malayalam-En; Low-resource: Magahi- En, Maithili-En; East Asian: Chinese-En SwitchLingua (Xie et al.,
-
[41]
(multi-ethnic); X-RiSAWOZ (5 lan- guages) (Moradshahi et al., 2023); MuRIL corpus (17M Indian) (Goswami et al., 2023); MultiCoNER (11 lan- guages) (Malmasi et al., 2022b) Universal: XLM-R (100 langs), mBERT (104 langs); Regional: MuRIL (17 Indic), IndicBERT (12 Indic) (Goswami et al., 2023); GPT-4 (inconsistent on low-resource) Cross-lingual trans- fer; P...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.