Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs
Pith reviewed 2026-05-16 11:47 UTC · model grok-4.3
The pith
A unifying taxonomy organizes code-mixing research in LLMs along data, modeling, and evaluation axes into a practical playbook.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents a taxonomy that classifies prior work on code-mixing and code-switching in large language models along three dimensions—data, modeling, and evaluation—and distills the reviewed findings into an actionable playbook of recommendations for constructing, adapting, and testing CSW-capable LLMs.
What carries the argument
The unifying taxonomy that sorts research by data practices, modeling techniques, and evaluation methods, from which a practical playbook of recommendations is derived.
If this is right
- Engineers can use the distilled recommendations to improve grammaticality and factuality of LLMs on mixed-language inputs.
- Evaluation protocols can be revised to reduce sources of instability and improve reproducibility across studies.
- Safety testing must include code-mixing prompts to detect potential bypasses of built-in safeguards.
- New benchmarks should expand language-pair coverage beyond English-centric collections.
- Future modeling work can target the open challenges identified after reviewing current pre-training, post-training, and prompting methods.
Where Pith is reading between the lines
- The same taxonomy structure could be tested on other multilingual behaviors such as dialect blending or informal register shifts.
- Real-world deployment in bilingual regions would provide a natural test of whether the playbook improves user-facing performance.
- Combining the recommendations with existing safety alignment methods could reduce the risk of code-mixing bypasses.
- Developers working on low-resource language pairs may need to generate synthetic mixed data following the taxonomy's data guidelines.
Load-bearing premise
The published studies on code-mixing in LLMs are representative enough that the three-axis taxonomy covers the main approaches without large omissions.
What would settle it
Publication of a substantial new code-mixing technique or benchmark that fits none of the taxonomy's data, modeling, or evaluation categories would show the framework is incomplete.
Figures
read the original abstract
Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper provides a comprehensive overview of code-mixing and code-switching (CSW) research for LLMs, introducing a unifying taxonomy organized along data, modeling, and evaluation dimensions. It distills these into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs, reviews modeling approaches from CSW-tailored pre-training and post-training to prompting and in-context learning, analyzes evaluation practices for instability and reproducibility issues, catalogs benchmarks while examining linguistic coverage and English-centric biases, and discusses safety concerns including code-mixing for bypassing safeguards along with open challenges.
Significance. If the taxonomy is grounded in representative coverage, this synthesis could serve as a useful practical reference for practitioners, consolidating scattered findings on modeling strategies, evaluation pitfalls, and safety risks into actionable guidance that might help standardize work on multilingual LLMs in mixed-language settings.
major comments (2)
- [Introduction / abstract] The abstract and introduction provide no description of the literature review protocol (search terms, databases, date range, inclusion/exclusion criteria, or handling of non-English sources). This is load-bearing for the central claim because the unifying taxonomy and derived playbook rest on the assumption that the cataloged work is representative across data, modeling, and evaluation; unspecified methodology leaves open the possibility of systematic omissions that would weaken the recommendations.
- [Evaluation practices] The evaluation section claims to highlight sources of instability and limited reproducibility but does not provide concrete quantitative comparisons (e.g., variance in scores across runs or specific benchmark pairs) to demonstrate that these issues are widespread enough to undermine current practices. Without such grounding, the critique of English-centric biases and the call for better benchmarks remain high-level.
minor comments (1)
- [Title] The title contains an extraneous space before the colon (Code Mixologist : A ...); standard formatting omits this space.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our survey. We address each major comment below and will revise the manuscript to improve transparency and grounding.
read point-by-point responses
-
Referee: [Introduction / abstract] The abstract and introduction provide no description of the literature review protocol (search terms, databases, date range, inclusion/exclusion criteria, or handling of non-English sources). This is load-bearing for the central claim because the unifying taxonomy and derived playbook rest on the assumption that the cataloged work is representative across data, modeling, and evaluation; unspecified methodology leaves open the possibility of systematic omissions that would weaken the recommendations.
Authors: We acknowledge the value of explicit methodology for transparency. As a practitioner's guide rather than a formal systematic review, our literature selection prioritized recent, empirically grounded works on LLM code-mixing (primarily post-2020 to capture the shift to instruction-tuned models). To address this, we will add a dedicated subsection in the introduction describing key search terms (e.g., 'code-mixing LLM', 'code-switching pre-training'), sources (arXiv, ACL Anthology, major conferences), date range, and inclusion criteria focused on relevance to data, modeling, or evaluation for LLMs, while noting coverage of non-English sources where available. This will clarify the scope supporting the taxonomy. revision: yes
-
Referee: [Evaluation practices] The evaluation section claims to highlight sources of instability and limited reproducibility but does not provide concrete quantitative comparisons (e.g., variance in scores across runs or specific benchmark pairs) to demonstrate that these issues are widespread enough to undermine current practices. Without such grounding, the critique of English-centric biases and the call for better benchmarks remain high-level.
Authors: We agree that quantitative grounding would strengthen the section. The current manuscript discusses instability qualitatively (e.g., prompt sensitivity and cross-run variance in mixed-language settings). In revision, we will add specific examples from the literature, such as reported standard deviations in CSW benchmark scores (e.g., 4-8 point BLEU variance across seeds in certain pairs) and comparisons showing larger performance gaps in code-mixed vs. monolingual evaluations. This will better support the analysis of English-centric biases and the recommendations for improved benchmarks. revision: yes
Circularity Check
No circularity: survey synthesizes external literature without self-referential derivations
full rationale
This is a review paper that introduces a taxonomy by organizing prior external work along data/modeling/evaluation dimensions and distills recommendations from that review. No equations, fitted parameters, predictions from own data, or load-bearing self-citations appear in the provided text. The central claims rest on cataloging and critiquing existing benchmarks and approaches rather than reducing to the paper's own inputs by construction. Absence of search protocol is a methodological limitation but does not create circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
PRO-CS : An instance-based prompt composition technique for code-switched tasks
Srijan Bansal, Suraj Tripathi, Sumit Agarwal, Teruko Mi- tamura, and Eric Nyberg. PRO-CS : An instance-based prompt composition technique for code-switched tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, ed- itors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10243– 10255, Abu Dhabi, United Arab E...
work page 2022
-
[2]
Association for Computational Linguistics. Aditeya Baral, Allen George Ajith, Roshan Nayak, and Mri- tyunjay Abhijeet Bhanja. Cmlformer: A dual decoder transformer with switching point learning for code-mixed language modeling.ArXiv, abs/2505.12587,
-
[3]
Junggeun Do, Jaeseong Lee, and Seung-won Hwang. Con- trastiveMix: Overcoming code-mixing dilemma in cross- lingual transfer for information retrieval. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume...
work page 2024
-
[4]
Hao Fei, Meishan Zhang, M. Zhang, and Tat seng Chua. Constructing code-mixed universal dependency forest for unbiased cross-lingual relation extraction.ArXiv, abs/2305.12258,
-
[5]
Dhiman Goswami, Antara Mahmud, and Md. Nishat Rai- han. Mixed-distil-bert: Code-mixed language modeling for bangla, english, and hindi.ArXiv, abs/2309.10272,
-
[6]
InFind- ings of the Association for Computational Linguistics: ACL 2024
Ayushman Gupta, Akhil Bhogal, and Kripabandhu Ghosh. Code-mixer ya nahi: Novel approaches to measur- ing multilingual llms’ code-mixing capabilities.ArXiv, abs/2410.11079,
- [7]
-
[8]
I-Hung Hsu, Avik Ray, Shubham Garg, Nanyun Peng, and Jing Huang
Asso- ciation for Computational Linguistics. I-Hung Hsu, Avik Ray, Shubham Garg, Nanyun Peng, and Jing Huang. Code-switched text synthesis in unseen lan- guage pairs. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 5137–5151, Toronto, Canada, July
work page 2023
-
[9]
Retraining distilbert for a voice shopping assistant by using universal dependencies
Pratik Jayarao and Arpit Sharma. Retraining distilbert for a voice shopping assistant by using universal dependencies. CoRR, abs/2103.15737,
-
[10]
Intent detection for code-mix utterances in task oriented dialogue systems
Pratik Jayarao and Aman Srivastava. Intent detection for code-mix utterances in task oriented dialogue systems. In 2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), pages 583–587,
work page 2018
-
[11]
Pratik Jayarao, Chirag Jain, and Aman Srivastava. Explor- ing the importance of context and embeddings in neural NER models for task-oriented dialogue systems.CoRR, abs/1812.02370,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Pratik Sridatt Jayarao, Himanshu Gupta, Neeraj Varshney, and Chaitanya Dwivedi
Associa- tion for Computational Linguistics. Pratik Sridatt Jayarao, Himanshu Gupta, Neeraj Varshney, and Chaitanya Dwivedi. How far can SLMs go without ‘thinking’ in the LLM-as-a-judge paradigm? InNeurIPS 2025 Workshop on Efficient Reasoning,
work page 2025
-
[13]
Jangyeong Jeon, Sangyeon Cho, Minuk Ma, and Juny- oung Kim. Concse: Unified contrastive learning and augmentation for code-switched embeddings.ArXiv, abs/2409.00120,
-
[14]
Aditya Kallappa, Palash Kamble, Abhinav Ravi, Akshat Pati- dar, Vinayak Dhruv, Deepak Kumar, Raghav Awasthi, Arveti Manjunath, Himanshu Gupta, Shubham Agarwal, et al. Krutrim llm: Multilingual foundational model for over a billion people.arXiv preprint arXiv:2502.09642,
-
[15]
Seoyeon Kim, Huiseo Kim, Chanjun Park, Jinyoung Yeo, and Dongha Lee
Association for Computational Linguistics. Seoyeon Kim, Huiseo Kim, Chanjun Park, Jinyoung Yeo, and Dongha Lee. Can code-switched texts activate a knowl- edge switch in llms? a case study on english-korean code- switching.ArXiv, abs/2410.18436,
-
[16]
Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Bona- giri, Anirudh Govil, Monojit Choudhury, Manish Shrivas- tava, and P. Kumaraguru. From human judgements to pre- dictive models: Unravelling acceptability in code-mixed sentences.ArXiv, abs/2405.05572,
-
[17]
Rahul Kumar, Shubham Kakde, Divyansh Rajput, Daud Ibrahim, Rishabh Nahata, Pidathala Sowjanya, Deepak Kumarr, Gautam Bhargava, and Chandra Khatri. Krutrim llm: A novel tokenization strategy for multilingual in- dic languages with petabyte-scale data processing.arXiv preprint arXiv:2407.12481,
-
[18]
Garry Kuwanto, Chaitanya Agarwal, Genta Indra Winata, and Derry Tanti Wijaya. Linguistics theory meets llm: Code- switched text generation via equivalence constrained large language models.ArXiv, abs/2410.22660,
-
[19]
COMMIT: Code-mixing English-centric large language model for multilingual instruction tuning
Jaeseong Lee, YeonJoon Jung, and Seung-won Hwang. COMMIT: Code-mixing English-centric large language model for multilingual instruction tuning. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 3130–3137, Mexico City, Mexico, June
work page 2024
-
[20]
Guduru Manoj, Neel Prabhanjan Rachamalla, Ashish Kulka- rni, Gautam Rajeev, Jay Piplodiya, Arul Menezes, Sha- harukh Khan, Souvik Rana, Manya Sah, Chandra Kha- tri, et al. Bhashakritika: Building synthetic pretrain- ing data at scale for indic languages.arXiv preprint arXiv:2511.10338,
-
[21]
Amr Mohamed, Yang Zhang, M. Vazirgiannis, and Guokan Shang. Lost in the mix: Evaluating llm understanding of code-switched text.ArXiv, abs/2506.14012,
-
[22]
Ercong Nie, Helmut Schmid, and Hinrich Schutze. Mech- anistic understanding and mitigation of language confu- sion in english-centric large language models.ArXiv, abs/2505.16538,
-
[23]
Souvik Rana, Arul Menezes, Ashish Kulkarni, Chandra Kha- tri, and Shubham Agarwal. Indicsupertokenizer: An opti- mized tokenizer for indic multilingual llms.arXiv preprint arXiv:2511.03237,
-
[24]
Shad Akhtar, and Tanmoy Chakraborty
Ayan Sengupta, Md. Shad Akhtar, and Tanmoy Chakraborty. Persona-aware generative model for code-mixed language. ArXiv, abs/2309.02915,
-
[25]
Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himan- shu Beniwal, and Mayank Singh. Beyond monolingual as- sumptions: A survey of code-switched nlp in era of large language models.arXiv preprint arXiv:2510.07037,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Igor Sterner and Simone Teufel
Association for Computational Linguistics. Igor Sterner and Simone Teufel. Minimal pair-based evalua- tion of code-switching.ArXiv, abs/2506.01840,
-
[27]
Genta Indra Winata, Alham Fikri Aji, Zheng-Xin Yong, and Thamar Solorio. The decades progress on code-switching research in nlp: A systematic survey on trends and chal- lenges.Findings of the Association for Computational Lin- guistics: ACL 2023, pages 2936–2978,
work page 2023
-
[28]
Multilingual large language models are not (yet) code- switchers
Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Indra Winata, and Alham Fikri Aji. Multilingual large language models are not (yet) code- switchers. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore,
work page 2023
-
[29]
Association for Computa- tional Linguistics. Ruochen Zhang, Long Phan, Samuel Cahyawijaya, Holy Lovenia, Zheng-Xin Yong, Genta Indra Winata, Al- ham Fikri Aji, Lintang Sutawika, Arjun Subramonian, Jes- sica Zosa Forde, Yinghua Tan, Jan Christian Blaise Cruz, and Skyler Wang. Prompting multilingual large language models to generate code-mixed texts: The ca...
-
[30]
Code-switching in-context learning for cross-lingual transfer of large language mod- els, 2025
Zining Zhu, Houjiang Liu, et al. Code-switching in-context learning for cross-lingual transfer of large language mod- els, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.