arxiv: 2602.11181 · v2 · submitted 2026-01-21 · 💻 cs.CL

Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

Himanshu Gupta , Pratik Jayarao , Chaitanya Dwivedi , Neeraj Varshney This is my paper

Pith reviewed 2026-05-16 11:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords code-mixingcode-switchinglarge language modelsmultilingual modelingtaxonomyevaluation benchmarksmodel safety

0 comments p. Extension

The pith

A unifying taxonomy organizes code-mixing research in LLMs along data, modeling, and evaluation axes into a practical playbook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reviews how large language models perform when users blend multiple languages within the same sentence or conversation, a pattern called code-mixing or code-switching. It collects scattered studies and groups them under a single taxonomy that tracks choices about training data, model adaptation methods, and testing procedures. The authors then convert the grouped findings into a set of concrete recommendations that engineers can follow when building or improving LLMs for mixed-language settings. The overview also flags instability in current benchmarks and notes that code-mixing can be used to weaken model safety filters.

Core claim

The paper presents a taxonomy that classifies prior work on code-mixing and code-switching in large language models along three dimensions—data, modeling, and evaluation—and distills the reviewed findings into an actionable playbook of recommendations for constructing, adapting, and testing CSW-capable LLMs.

What carries the argument

The unifying taxonomy that sorts research by data practices, modeling techniques, and evaluation methods, from which a practical playbook of recommendations is derived.

If this is right

Engineers can use the distilled recommendations to improve grammaticality and factuality of LLMs on mixed-language inputs.
Evaluation protocols can be revised to reduce sources of instability and improve reproducibility across studies.
Safety testing must include code-mixing prompts to detect potential bypasses of built-in safeguards.
New benchmarks should expand language-pair coverage beyond English-centric collections.
Future modeling work can target the open challenges identified after reviewing current pre-training, post-training, and prompting methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same taxonomy structure could be tested on other multilingual behaviors such as dialect blending or informal register shifts.
Real-world deployment in bilingual regions would provide a natural test of whether the playbook improves user-facing performance.
Combining the recommendations with existing safety alignment methods could reduce the risk of code-mixing bypasses.
Developers working on low-resource language pairs may need to generate synthetic mixed data following the taxonomy's data guidelines.

Load-bearing premise

The published studies on code-mixing in LLMs are representative enough that the three-axis taxonomy covers the main approaches without large omissions.

What would settle it

Publication of a substantial new code-mixing technique or benchmark that fits none of the taxonomy's data, modeling, or evaluation categories would show the framework is incomplete.

Figures

Figures reproduced from arXiv: 2602.11181 by Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Pratik Jayarao.

read the original abstract

Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical synthesis of code-mixing work that flags real evaluation problems but leaves its own literature coverage unverified.

read the letter

This paper collects existing research on code-mixing in LLMs, sorts it into a taxonomy covering data, modeling, and evaluation, and turns the summary into a set of concrete recommendations for practitioners. That organization is the main contribution. It does a reasonable job walking through modeling choices such as tailored pre-training, post-training, and prompting, and it correctly notes that current benchmarks suffer from instability and English-centric coverage. The safety section on code-mixing as a way to bypass filters is also worth having in one place. The central claims rest on prior literature rather than new experiments, so there is little circularity or invented data. The soft spot is the missing review protocol. The abstract and summary give no search terms, databases, date range, or inclusion rules, which makes it impossible to check whether the taxonomy omits important papers on newer modeling tricks or non-English benchmarks. That gap is real but not fatal for a synthesis piece; it just means readers will want to cross-check coverage themselves. The paper is aimed at engineers and researchers who need a quick map before building or evaluating mixed-language systems. It is not a breakthrough but it is coherent and points to open problems. I would send it to peer review so the coverage can be checked and the playbook can be stress-tested by people who actually use these models.

Referee Report

2 major / 1 minor

Summary. The paper provides a comprehensive overview of code-mixing and code-switching (CSW) research for LLMs, introducing a unifying taxonomy organized along data, modeling, and evaluation dimensions. It distills these into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs, reviews modeling approaches from CSW-tailored pre-training and post-training to prompting and in-context learning, analyzes evaluation practices for instability and reproducibility issues, catalogs benchmarks while examining linguistic coverage and English-centric biases, and discusses safety concerns including code-mixing for bypassing safeguards along with open challenges.

Significance. If the taxonomy is grounded in representative coverage, this synthesis could serve as a useful practical reference for practitioners, consolidating scattered findings on modeling strategies, evaluation pitfalls, and safety risks into actionable guidance that might help standardize work on multilingual LLMs in mixed-language settings.

major comments (2)

[Introduction / abstract] The abstract and introduction provide no description of the literature review protocol (search terms, databases, date range, inclusion/exclusion criteria, or handling of non-English sources). This is load-bearing for the central claim because the unifying taxonomy and derived playbook rest on the assumption that the cataloged work is representative across data, modeling, and evaluation; unspecified methodology leaves open the possibility of systematic omissions that would weaken the recommendations.
[Evaluation practices] The evaluation section claims to highlight sources of instability and limited reproducibility but does not provide concrete quantitative comparisons (e.g., variance in scores across runs or specific benchmark pairs) to demonstrate that these issues are widespread enough to undermine current practices. Without such grounding, the critique of English-centric biases and the call for better benchmarks remain high-level.

minor comments (1)

[Title] The title contains an extraneous space before the colon (Code Mixologist : A ...); standard formatting omits this space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our survey. We address each major comment below and will revise the manuscript to improve transparency and grounding.

read point-by-point responses

Referee: [Introduction / abstract] The abstract and introduction provide no description of the literature review protocol (search terms, databases, date range, inclusion/exclusion criteria, or handling of non-English sources). This is load-bearing for the central claim because the unifying taxonomy and derived playbook rest on the assumption that the cataloged work is representative across data, modeling, and evaluation; unspecified methodology leaves open the possibility of systematic omissions that would weaken the recommendations.

Authors: We acknowledge the value of explicit methodology for transparency. As a practitioner's guide rather than a formal systematic review, our literature selection prioritized recent, empirically grounded works on LLM code-mixing (primarily post-2020 to capture the shift to instruction-tuned models). To address this, we will add a dedicated subsection in the introduction describing key search terms (e.g., 'code-mixing LLM', 'code-switching pre-training'), sources (arXiv, ACL Anthology, major conferences), date range, and inclusion criteria focused on relevance to data, modeling, or evaluation for LLMs, while noting coverage of non-English sources where available. This will clarify the scope supporting the taxonomy. revision: yes
Referee: [Evaluation practices] The evaluation section claims to highlight sources of instability and limited reproducibility but does not provide concrete quantitative comparisons (e.g., variance in scores across runs or specific benchmark pairs) to demonstrate that these issues are widespread enough to undermine current practices. Without such grounding, the critique of English-centric biases and the call for better benchmarks remain high-level.

Authors: We agree that quantitative grounding would strengthen the section. The current manuscript discusses instability qualitatively (e.g., prompt sensitivity and cross-run variance in mixed-language settings). In revision, we will add specific examples from the literature, such as reported standard deviations in CSW benchmark scores (e.g., 4-8 point BLEU variance across seeds in certain pairs) and comparisons showing larger performance gaps in code-mixed vs. monolingual evaluations. This will better support the analysis of English-centric biases and the recommendations for improved benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: survey synthesizes external literature without self-referential derivations

full rationale

This is a review paper that introduces a taxonomy by organizing prior external work along data/modeling/evaluation dimensions and distills recommendations from that review. No equations, fitted parameters, predictions from own data, or load-bearing self-citations appear in the provided text. The central claims rest on cataloging and critiquing existing benchmarks and approaches rather than reducing to the paper's own inputs by construction. Absence of search protocol is a methodological limitation but does not create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper the contribution is organizational synthesis of existing studies rather than introduction of new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5484 in / 1002 out tokens · 42522 ms · 2026-05-16T11:47:33.754973+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

PRO-CS : An instance-based prompt composition technique for code-switched tasks

Srijan Bansal, Suraj Tripathi, Sumit Agarwal, Teruko Mi- tamura, and Eric Nyberg. PRO-CS : An instance-based prompt composition technique for code-switched tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, ed- itors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10243– 10255, Abu Dhabi, United Arab E...

work page 2022
[2]

InFindings of the Association for Computational Linguistics: ACL 2023, pages 7985–8002, Toronto, Canada

Association for Computational Linguistics. Aditeya Baral, Allen George Ajith, Roshan Nayak, and Mri- tyunjay Abhijeet Bhanja. Cmlformer: A dual decoder transformer with switching point learning for code-mixed language modeling.ArXiv, abs/2505.12587,

work page arXiv
[3]

Con- trastiveMix: Overcoming code-mixing dilemma in cross- lingual transfer for information retrieval

Junggeun Do, Jaeseong Lee, and Seung-won Hwang. Con- trastiveMix: Overcoming code-mixing dilemma in cross- lingual transfer for information retrieval. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume...

work page 2024
[4]

Zhang, and Tat seng Chua

Hao Fei, Meishan Zhang, M. Zhang, and Tat seng Chua. Constructing code-mixed universal dependency forest for unbiased cross-lingual relation extraction.ArXiv, abs/2305.12258,

work page arXiv
[5]

Nishat Rai- han

Dhiman Goswami, Antara Mahmud, and Md. Nishat Rai- han. Mixed-distil-bert: Code-mixed language modeling for bangla, english, and hindi.ArXiv, abs/2309.10272,

work page arXiv
[6]

InFind- ings of the Association for Computational Linguistics: ACL 2024

Ayushman Gupta, Akhil Bhogal, and Kripabandhu Ghosh. Code-mixer ya nahi: Novel approaches to measur- ing multilingual llms’ code-mixing capabilities.ArXiv, abs/2410.11079,

work page arXiv
[7]

Maite Heredia, Gorka Labaka, Jeremy Barnes, and A. Soroa. Conditioning llms to generate code-switched text: A methodology grounded in naturally occurring data.ArXiv, abs/2502.12924,

work page arXiv
[8]

I-Hung Hsu, Avik Ray, Shubham Garg, Nanyun Peng, and Jing Huang

Asso- ciation for Computational Linguistics. I-Hung Hsu, Avik Ray, Shubham Garg, Nanyun Peng, and Jing Huang. Code-switched text synthesis in unseen lan- guage pairs. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 5137–5151, Toronto, Canada, July

work page 2023
[9]

Retraining distilbert for a voice shopping assistant by using universal dependencies

Pratik Jayarao and Arpit Sharma. Retraining distilbert for a voice shopping assistant by using universal dependencies. CoRR, abs/2103.15737,

work page arXiv
[10]

Intent detection for code-mix utterances in task oriented dialogue systems

Pratik Jayarao and Aman Srivastava. Intent detection for code-mix utterances in task oriented dialogue systems. In 2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), pages 583–587,

work page 2018
[11]

Exploring the importance of context and embeddings in neural NER models for task-oriented dialogue systems

Pratik Jayarao, Chirag Jain, and Aman Srivastava. Explor- ing the importance of context and embeddings in neural NER models for task-oriented dialogue systems.CoRR, abs/1812.02370,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Pratik Sridatt Jayarao, Himanshu Gupta, Neeraj Varshney, and Chaitanya Dwivedi

Associa- tion for Computational Linguistics. Pratik Sridatt Jayarao, Himanshu Gupta, Neeraj Varshney, and Chaitanya Dwivedi. How far can SLMs go without ‘thinking’ in the LLM-as-a-judge paradigm? InNeurIPS 2025 Workshop on Efficient Reasoning,

work page 2025
[13]

Concse: Unified contrastive learning and augmentation for code-switched embeddings.ArXiv, abs/2409.00120,

Jangyeong Jeon, Sangyeon Cho, Minuk Ma, and Juny- oung Kim. Concse: Unified contrastive learning and augmentation for code-switched embeddings.ArXiv, abs/2409.00120,

work page arXiv
[14]

Krutrim llm: Multilingual foundational model for over a billion people.arXiv preprint arXiv:2502.09642,

Aditya Kallappa, Palash Kamble, Abhinav Ravi, Akshat Pati- dar, Vinayak Dhruv, Deepak Kumar, Raghav Awasthi, Arveti Manjunath, Himanshu Gupta, Shubham Agarwal, et al. Krutrim llm: Multilingual foundational model for over a billion people.arXiv preprint arXiv:2502.09642,

work page arXiv
[15]

Seoyeon Kim, Huiseo Kim, Chanjun Park, Jinyoung Yeo, and Dongha Lee

Association for Computational Linguistics. Seoyeon Kim, Huiseo Kim, Chanjun Park, Jinyoung Yeo, and Dongha Lee. Can code-switched texts activate a knowl- edge switch in llms? a case study on english-korean code- switching.ArXiv, abs/2410.18436,

work page arXiv
[16]

Kumaraguru

Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Bona- giri, Anirudh Govil, Monojit Choudhury, Manish Shrivas- tava, and P. Kumaraguru. From human judgements to pre- dictive models: Unravelling acceptability in code-mixed sentences.ArXiv, abs/2405.05572,

work page arXiv
[17]

Krutrim llm: A novel tokenization strategy for multilingual in- dic languages with petabyte-scale data processing.arXiv preprint arXiv:2407.12481,

Rahul Kumar, Shubham Kakde, Divyansh Rajput, Daud Ibrahim, Rishabh Nahata, Pidathala Sowjanya, Deepak Kumarr, Gautam Bhargava, and Chandra Khatri. Krutrim llm: A novel tokenization strategy for multilingual in- dic languages with petabyte-scale data processing.arXiv preprint arXiv:2407.12481,

work page arXiv
[18]

Linguistics theory meets llm: Code- switched text generation via equivalence constrained large language models.ArXiv, abs/2410.22660,

Garry Kuwanto, Chaitanya Agarwal, Genta Indra Winata, and Derry Tanti Wijaya. Linguistics theory meets llm: Code- switched text generation via equivalence constrained large language models.ArXiv, abs/2410.22660,

work page arXiv
[19]

COMMIT: Code-mixing English-centric large language model for multilingual instruction tuning

Jaeseong Lee, YeonJoon Jung, and Seung-won Hwang. COMMIT: Code-mixing English-centric large language model for multilingual instruction tuning. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 3130–3137, Mexico City, Mexico, June

work page 2024
[20]

Bhashakritika: Building synthetic pretrain- ing data at scale for indic languages.arXiv preprint arXiv:2511.10338,

Guduru Manoj, Neel Prabhanjan Rachamalla, Ashish Kulka- rni, Gautam Rajeev, Jay Piplodiya, Arul Menezes, Sha- harukh Khan, Souvik Rana, Manya Sah, Chandra Kha- tri, et al. Bhashakritika: Building synthetic pretrain- ing data at scale for indic languages.arXiv preprint arXiv:2511.10338,

work page arXiv
[21]

Pengpun, P.; Tiankanon, K.; Chinkamol, A.; Kinchagawat, J.; Chairuengjitjaras, P.; Supholkhan, P.; Aussavavirojekul, P.; Boonnag, C.; Veerakanjana, K.; Phimsiri, H.; et al

Amr Mohamed, Yang Zhang, M. Vazirgiannis, and Guokan Shang. Lost in the mix: Evaluating llm understanding of code-switched text.ArXiv, abs/2506.14012,

work page arXiv
[22]

Mech- anistic understanding and mitigation of language confu- sion in english-centric large language models.ArXiv, abs/2505.16538,

Ercong Nie, Helmut Schmid, and Hinrich Schutze. Mech- anistic understanding and mitigation of language confu- sion in english-centric large language models.ArXiv, abs/2505.16538,

work page arXiv
[23]

Indicsupertokenizer: An opti- mized tokenizer for indic multilingual llms.arXiv preprint arXiv:2511.03237,

Souvik Rana, Arul Menezes, Ashish Kulkarni, Chandra Kha- tri, and Shubham Agarwal. Indicsupertokenizer: An opti- mized tokenizer for indic multilingual llms.arXiv preprint arXiv:2511.03237,

work page arXiv
[24]

Shad Akhtar, and Tanmoy Chakraborty

Ayan Sengupta, Md. Shad Akhtar, and Tanmoy Chakraborty. Persona-aware generative model for code-mixed language. ArXiv, abs/2309.02915,

work page arXiv
[25]

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models across Modalities

Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himan- shu Beniwal, and Mayank Singh. Beyond monolingual as- sumptions: A survey of code-switched nlp in era of large language models.arXiv preprint arXiv:2510.07037,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Igor Sterner and Simone Teufel

Association for Computational Linguistics. Igor Sterner and Simone Teufel. Minimal pair-based evalua- tion of code-switching.ArXiv, abs/2506.01840,

work page arXiv
[27]

Genta Indra Winata, Alham Fikri Aji, Zheng-Xin Yong, and Thamar Solorio. The decades progress on code-switching research in nlp: A systematic survey on trends and chal- lenges.Findings of the Association for Computational Lin- guistics: ACL 2023, pages 2936–2978,

work page 2023
[28]

Multilingual large language models are not (yet) code- switchers

Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Indra Winata, and Alham Fikri Aji. Multilingual large language models are not (yet) code- switchers. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore,

work page 2023
[29]

Association for Computa- tional Linguistics. Ruochen Zhang, Long Phan, Samuel Cahyawijaya, Holy Lovenia, Zheng-Xin Yong, Genta Indra Winata, Al- ham Fikri Aji, Lintang Sutawika, Arjun Subramonian, Jes- sica Zosa Forde, Yinghua Tan, Jan Christian Blaise Cruz, and Skyler Wang. Prompting multilingual large language models to generate code-mixed texts: The ca...

work page arXiv
[30]

Code-switching in-context learning for cross-lingual transfer of large language mod- els, 2025

Zining Zhu, Houjiang Liu, et al. Code-switching in-context learning for cross-lingual transfer of large language mod- els, 2025

work page 2025