pith. sign in

arxiv: 2510.07037 · v6 · submitted 2025-10-08 · 💻 cs.CL

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models across Modalities

Pith reviewed 2026-05-18 09:25 UTC · model grok-4.3

classification 💻 cs.CL
keywords code-switchinglarge language modelsmultilingual NLPsurveymixed-language inputsNLP tasksdatasetsevaluation biases
0
0 comments X p. Extension

The pith

A survey of 327 studies shows large language models still struggle with code-switched language inputs across modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey examines how large language models handle code-switched language, where people mix languages within sentences or conversations. It reviews 327 studies across five research areas, more than 15 NLP tasks, 30 datasets, and 80 languages to categorize advances in model architecture, training strategies, and evaluation methods. The work outlines how LLMs have reshaped code-switching research while highlighting persistent problems such as limited datasets and evaluation biases that hinder real-world use. A sympathetic reader would care because code-switching occurs daily in multilingual communities, and fixing these gaps could make AI systems more practical and equitable for speakers who do not use a single language exclusively.

Core claim

The paper provides the first comprehensive analysis of CSW-aware LLM research by reviewing 327 studies spanning five research areas, 15+ NLP tasks, 30+ datasets, and 80+ languages. It categorizes recent advances by architecture, training strategy, and evaluation methodology, outlines how LLMs have reshaped CSW modeling, identifies the challenges that persist, and concludes with a roadmap that emphasizes the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual capabilities.

What carries the argument

Categorization of advances by architecture, training strategy, and evaluation methodology, together with a proposed roadmap for inclusive datasets and fair evaluation.

Load-bearing premise

The literature search and selection process captured a representative sample of all relevant code-switched LLM work without major omissions or language biases.

What would settle it

Discovery of a substantial body of relevant pre-2024 studies on code-switched LLMs that were omitted from the reviewed set of 327 papers would undermine the claim of comprehensive coverage.

Figures

Figures reproduced from arXiv: 2510.07037 by Himanshu Beniwal, Mahavir Patil, Mayank Singh, Rajvee Sheth, Samridhi Raj Sinha.

Figure 1
Figure 1. Figure 1: Common model failures on code-mixed text: [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A taxonomy of the code-switching research landscape. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Failure cases when we prompt ChatGPT in Odia-Romanized Hindi code-mixed pair. [PITH_FULL_IMAGE:figures/full_fig_p035_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Failure cases when we prompt GLM-4.6 in Bangla-English code-mixed pair. [PITH_FULL_IMAGE:figures/full_fig_p035_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Failure cases when we prompt Perplexity in Konkani-English code-mixed pair. [PITH_FULL_IMAGE:figures/full_fig_p036_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of Code-Switching Dataset Sizes. [PITH_FULL_IMAGE:figures/full_fig_p036_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Top 20 Language Pairs in Code-Switching Research [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of NLP Tasks and Categories in Code-Switching Research. [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Language pair distribution across 202 code-mixing related datasets and benchmarks papers. [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗
read the original abstract

Amidst the rapid advances of large language models (LLMs), most LLMs still struggle with mixed-language inputs, limited Codeswitching (CSW) datasets, and evaluation biases, which hinder their deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing 327 studies spanning five research areas, 15+ NLP tasks, 30+ datasets, and 80+ languages. We categorize recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and identifying the challenges that persist. The paper concludes with a roadmap that emphasizes the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual capabilities https://github.com/lingo-iitgn/awesome-code-mixing/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. This survey paper provides the first comprehensive analysis of code-switched (CSW) NLP research in the era of large language models (LLMs) across modalities. It reviews 327 studies from five research areas, covering 15+ NLP tasks, 30+ datasets, and 80+ languages. The paper categorizes recent advances by architecture, training strategy, and evaluation methodology, identifies ongoing challenges, and offers a roadmap for future work including the need for inclusive datasets, fair evaluation, and linguistically grounded models.

Significance. Assuming the reviewed studies form a representative sample, this work would be significant for the field by synthesizing the current state of CSW-aware LLM research, which is crucial for multilingual societies. It highlights how LLMs have reshaped CSW modeling and points to persistent issues like limited datasets and evaluation biases. The accompanying GitHub repository adds practical value by curating resources.

major comments (1)
  1. [Literature Search and Selection (likely in Methods or §2)] The manuscript states a review of 327 studies but provides only high-level descriptions of the search strategy without exact keywords, databases used, time window (e.g., post-2022), inclusion/exclusion criteria, or a PRISMA flow diagram. This undermines the ability to verify the claim of comprehensive coverage across 80+ languages and recent preprints, raising concerns about potential omissions or biases.
minor comments (2)
  1. [Abstract] The title mentions 'across Modalities' but the abstract and categorization seem primarily focused on text-based CSW; consider clarifying the extent of multimodal coverage in the survey.
  2. [Overall] Ensure consistent use of abbreviations like CSW and LLM throughout the paper.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our survey. We appreciate the opportunity to clarify our methodology and strengthen the manuscript accordingly. Below we provide a point-by-point response to the major comment.

read point-by-point responses
  1. Referee: [Literature Search and Selection (likely in Methods or §2)] The manuscript states a review of 327 studies but provides only high-level descriptions of the search strategy without exact keywords, databases used, time window (e.g., post-2022), inclusion/exclusion criteria, or a PRISMA flow diagram. This undermines the ability to verify the claim of comprehensive coverage across 80+ languages and recent preprints, raising concerns about potential omissions or biases.

    Authors: We acknowledge that the current description of the literature search process is high-level and agree that greater transparency is warranted for a survey of this scope. In the revised version, we will expand §2 (or add a dedicated 'Literature Review Methodology' subsection) to explicitly document: (1) the full set of search keywords and Boolean queries (e.g., combinations of 'code-switching' OR 'code-mixing' OR 'codeswitch' AND ('LLM' OR 'large language model' OR 'transformer' OR 'multilingual') with modality-specific terms); (2) the databases and sources queried (arXiv, ACL Anthology, Google Scholar, EMNLP/ACL/NAACL proceedings, and selected preprint servers); (3) the primary time window (January 2022–mid-2024 to focus on the LLM era, with selective inclusion of foundational pre-2022 works); and (4) clear inclusion/exclusion criteria (e.g., studies must address at least one NLP task involving code-switched data with LLMs or closely related models; non-NLP or monolingual-only works excluded). We will also add a PRISMA-style flow diagram in the appendix showing the progression from initial retrievals to the final 327 studies. These additions will directly address concerns about verifiability, coverage of 80+ languages, and inclusion of recent preprints while preserving the survey's narrative flow. revision: yes

Circularity Check

0 steps flagged

No circularity: survey aggregates external studies without self-referential derivations or fitted predictions

full rationale

This is a literature survey paper whose central claims consist of categorizing and summarizing 327 external studies across tasks, datasets, and languages. No equations, quantitative predictions, or derivations appear in the provided abstract or structure. The 'first comprehensive' framing rests on an external literature search whose replicability is a methodological concern but does not constitute a circular reduction of any result to the paper's own inputs by construction. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a literature survey the central claim rests primarily on the completeness of the paper collection and the validity of the chosen categorization scheme; no new free parameters, mathematical axioms, or invented entities are introduced.

axioms (1)
  • domain assumption The 327 selected studies constitute a sufficiently complete and unbiased sample of CSW-aware LLM research.
    Invoked in the abstract when claiming the survey is the first comprehensive analysis.

pith-pipeline@v0.9.0 · 5686 in / 1162 out tokens · 23958 ms · 2026-05-18T09:25:40.755679+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

    cs.CL 2026-01 unverdicted novelty 5.0

    A survey that unifies prior code-switching research for LLMs into a taxonomy of data, modeling, and evaluation and distills it into actionable recommendations for practitioners.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    InProceedings of the Twelfth Language Resources and Evaluation Con- ference, pages 1803–1813, Marseille, France

    LinCE: A centralized benchmark for linguis- tic code-switching evaluation. InProceedings of the Twelfth Language Resources and Evaluation Con- ference, pages 1803–1813, Marseille, France. Euro- pean Language Resources Association. Gustavo Aguilar and Thamar Solorio. 2020. From English to code-switching: Transfer learning with strong morphological clues. I...

  2. [2]

    Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram

    Hope speech detection in code-mixed roman urdu tweets: A positive turn in natural language pro- cessing.ArXiv, abs/2506.21583. Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2024. MEGA VERSE: Benchmarking large language models acro...

  3. [3]

    Cross Script Hindi English NER Corpus from Wikipedia

    Towards real-world streaming speech trans- lation for code-switched speech. InProceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 14–22, Singapore. Association for Computational Linguistics. Djegdjiga Amazouz, Martine Adda-Decker, and Lori Lamel. 2017. Addressing code-switching in french/algerian arabic speech. ...

  4. [4]

    InFindings of the Association for Computational Linguistics: ACL 2023, pages 7985–8002, Toronto, Canada

    CoMix: Guide transformers to code-mix us- ing POS structure and phonetics. InFindings of the Association for Computational Linguistics: ACL 2023, pages 7985–8002, Toronto, Canada. Associa- tion for Computational Linguistics. Maria Riveena Arul, Vigneshwaran Shanmugasun- daram, S Rajalakshmi, Bharathi Raja Chakravarthi, and C. N. Subalalitha. 2025. MMS-5: ...

  5. [5]

    InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15622–15634, Miami, Florida, USA

    To ask LLMs about English grammaticality, prompt them in a different language. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15622–15634, Miami, Florida, USA. Association for Computational Linguistics. Astik Biswas, Febe de Wet, Ewald van der Westhuizen, and Thomas Niesler. 2020. Semi-supervised acous- tic and language mod...

  6. [6]

    Sharanya Chakravarthy, Anjana Umapathy, and Alan W Black

    Dravidiancodemix: Sentiment analysis and offensive language identification dataset for dravid- ian languages in code-mixed text.Language Re- sources and Evaluation, 56(3):765–806. Sharanya Chakravarthy, Anjana Umapathy, and Alan W Black. 2020. Detecting entailment in code- mixed Hindi-English conversations. InProceedings of the Sixth Workshop on Noisy Use...

  7. [7]

    An Empirical Study of Smoothing Techniques for Language Modeling

    ENTITYCS: Improving cross-lingual seman- tics with entity-level code-switching. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1812–1823, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Stanley F. Chen and Joshua T. Goodman. 1996. An em- pirical study of smoothing techniques fo...

  8. [8]

    InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 6460–6476, Toronto, Canada

    Advancing multi-criteria Chinese word seg- mentation through criterion classification and de- noising. InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 6460–6476, Toronto, Canada. Association for Computational Linguistics. Helin Cihan, Yunhan Wu, Paola Peña, Justin Edwards, and Be...

  9. [9]

    to have the ‘million’ readers yet

    IRNLP_DAIICT@LT-EDI-EACL2021: Hope speech detection in code mixed text using TF-IDF char n-grams and MuRIL. InProceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, pages 114–117, Kyiv. Association for Computational Linguistics. Oksana Dereza, Deirdre Ní Chonghaile, and Nicholas Wolf. 2024. “to have the ‘million’ r...

  10. [10]

    ContrastiveMix: Overcoming code-mixing dilemma in cross-lingual transfer for information retrieval. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 197– 204, Mexico City, Mexico. Association for Compu- tational Linguistics. Suma...

  11. [11]

    InFindings of the Association for Com- putational Linguistics: EMNLP 2022, pages 7820– 7832

    Adapting multilingual models for code-mixed translation. InFindings of the Association for Com- putational Linguistics: EMNLP 2022, pages 7820– 7832. Akshat Gupta, Sargam Menghani, Sai Krishna Ral- labandi, and Alan W Black. 2021b. Unsuper- vised self-training for sentiment analysis of code- switched data. InProceedings of the Fifth Workshop on Computatio...

  12. [12]

    Ju- lian’s, Malta

    Are large language model-based evaluators the solution to scaling up multilingual evaluation? InFindings of the Association for Computational Linguistics: EACL 2024, pages 1051–1070, St. Ju- lian’s, Malta. Association for Computational Lin- guistics. Injy Hamed, Thang Vu, and Nizar Habash. 2025. The impact of code-switched synthetic data quality is task d...

  13. [13]

    InProceedings of the 17th International Conference on Natural Language Processing (ICON), pages 470–474, Indian Institute of Technology Patna, Patna, India

    PhraseOut: A code mixed data augmentation method for MultilingualNeural machine tranlsation. InProceedings of the 17th International Conference on Natural Language Processing (ICON), pages 470–474, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI). Ganesh Jawahar, El Moatez Billah Nagoudi, Muham- mad Abdul-Mageed, and La...

  14. [14]

    org/10.55966/assaj

    The evolution of code-switching in mul- tilingual societies: A sociolinguistic perspective: https://doi. org/10.55966/assaj. 2025.4. 1.054.AS- SAJ, 4(01):614–625. Aditya Joshi, Ameya Prabhu, Manish Shrivastava, and Vasudeva Varma. 2016. Towards sub-word level compositions for sentiment analysis of Hindi- English code mixed text. InProceedings of COL- ING ...

  15. [15]

    InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6144–6153, Miami, Florida, USA

    PolyWER: A holistic evaluation framework for code-switched speech recognition. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6144–6153, Miami, Florida, USA. Association for Computational Linguistics. Yeeun Kang. 2024. CoV oSwitch: Machine translation of synthetic code-switched text based on intonation units. InProceedings ...

  16. [16]

    Can you translate for me? code-switched machine translation with large language models. In Proceedings of the 13th International Joint Confer- ence on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the As- sociation for Computational Linguistics (Volume 2: Short Papers), pages 83–92, Nusa Dua, Bali. Asso- ciation for Com...

  17. [17]

    InProceedings of the 2024 Joint In- ternational Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 8142–8149, Torino, Italia

    Prabhupadavani: A code-mixed speech trans- lation corpus. InProceedings of the 2024 Joint In- ternational Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 8142–8149, Torino, Italia. ELRA and ICCL. AK Indira Kumar, Gayathri Sthanusubramoni- ani, Deepa Gupta, Aarathi Rajagopalan Nair, Yousef Ajami Alota...

  18. [18]

    Cm_clip: Unveiling code-mixed multimodal learning with cross-lingual clip adaptations. In ICON. Garry Kuwanto, Chaitanya Agarwal, Genta Indra Winata, and Derry Tanti Wijaya. 2024. Linguistics theory meets llm: Code-switched text generation via equivalence constrained large language models. ArXiv, abs/2410.22660. Siyu Lai, Hui Huang, Dong Jing, Yufeng Chen...

  19. [19]

    InFindings of the Association for Com- putational Linguistics: NAACL 2024, pages 3130– 3137, Mexico City, Mexico

    COMMIT: Code-mixing English-centric large language model for multilingual instruction tuning. InFindings of the Association for Com- putational Linguistics: NAACL 2024, pages 3130– 3137, Mexico City, Mexico. Association for Com- putational Linguistics. Frances Adriana Laureano De Leon, Harish Tayyar Madabushi, and Mark Lee. 2024. Code-mixed probes show ho...

  20. [20]

    Hexin Liu, Haoyang Zhang, Qiquan Zhang, Xi- angyu Zhang, Dongyuan Shi, Eng Siong Chng, and Haizhou Li

    Boosting zero-shot cross-lingual retrieval by training on artificially code-switched data.ArXiv, abs/2305.05295. Hexin Liu, Haoyang Zhang, Qiquan Zhang, Xi- angyu Zhang, Dongyuan Shi, Eng Siong Chng, and Haizhou Li. 2025. Code-switching speech recogni- tion under the lens: Model-and data-centric perspec- tives.arXiv preprint arXiv:2509.24310. Ye Liu, Wolf...

  21. [21]

    Daniel Palomino and José Ochoa-Luna

    Code-switching asr for low-resource indic languages: A hindi-marathi case study.IEEE Ac- cess, 13:9171–9198. Daniel Palomino and José Ochoa-Luna. 2020. Palomino-ochoa at SemEval-2020 task 9: Robust system based on transformer for code-mixed sentiment classification. InProceedings of the Fourteenth Workshop on Semantic Evaluation, pages 963–967, Barcelona ...

  22. [22]

    InProceed- ings of the Fourteenth Workshop on Semantic Eval- uation, pages 774–790, Barcelona (online)

    SemEval-2020 task 9: Overview of senti- ment analysis of code-mixed tweets. InProceed- ings of the Fourteenth Workshop on Semantic Eval- uation, pages 774–790, Barcelona (online). Interna- tional Committee for Computational Linguistics. Siyao Peng, Zihang Sun, Huangyan Shan, Marie Kolm, Verena Blaschke, Ekaterina Artemova, and Barbara Plank. 2024. Sebasti...

  23. [23]

    In Proceedings of the 7th Workshop on Indian Lan- guage Data: Resources and Evaluation, pages 11– 16, Torino, Italia

    EmoMix-3L: A code-mixed dataset for Bangla-English-Hindi for emotion detection. In Proceedings of the 7th Workshop on Indian Lan- guage Data: Resources and Evaluation, pages 11– 16, Torino, Italia. ELRA and ICCL. Humair Raj Khan, Deepak Gupta, and Asif Ekbal

  24. [24]

    InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 1753–1767, Punta Cana, Dominican Republic

    Towards developing a multilingual and code- mixed visual question answering system by knowl- edge distillation. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 1753–1767, Punta Cana, Dominican Republic. As- sociation for Computational Linguistics. SB Rajeshwari and Jagadish S Kallimani. 2021. Re- gional language code-switchi...

  25. [25]

    kanglish alli names!

    GCM: A toolkit for generating synthetic code-mixed text. InProceedings of the 16th Con- ference of the European Chapter of the Association for Computational Linguistics: System Demonstra- tions, pages 205–211, Online. Association for Com- putational Linguistics. Sumukh S and Manish Shrivastava. 2022. “kanglish alli names!” named entity recognition for Kan...

  26. [26]

    InProceedings of the 12th International Conference on Natural Language Processing, pages 237–246, Trivandrum, India

    POS tagging of Hindi-English code mixed text from social media: Some machine learning ex- periments. InProceedings of the 12th International Conference on Natural Language Processing, pages 237–246, Trivandrum, India. NLP Association of In- dia. Sanket Shah, Pratik Joshi, Sebastin Santy, and Sunayana Sitaram. 2019. CoSSAT: Code-switched speech annotation ...

  27. [27]

    Test-time code-switching for cross-lingual as- pect sentiment triplet extraction. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long Papers), pages 5041–5053, Albuquerque, New Mexico. Association for Computational Linguistics. Mehak She...

  28. [28]

    InProceed- ings of the 18th International Conference on Natural Language Processing (ICON), Vasco da Gama, Goa, India

    CodeMixBench: A new benchmark for gen- erating code from code-mixed prompts. InProceed- ings of the 18th International Conference on Natural Language Processing (ICON), Vasco da Gama, Goa, India. NLP Association of India (NLPAI). Rajvee Sheth, Himanshu Beniwal, and Mayank Singh

  29. [29]

    Rajvee Sheth, Shubh Nisar, Heenaben Prajapati, Hi- manshu Beniwal, and Mayank Singh

    Comi-lingua: Expert annotated large-scale dataset for multitask nlp in hindi-english code- mixing.arXiv preprint arXiv:2503.21670. Rajvee Sheth, Shubh Nisar, Heenaben Prajapati, Hi- manshu Beniwal, and Mayank Singh. 2024. Com- mentator: A code-mixed multilingual text annota- tion framework. InProceedings of the 2024 Con- ference on Empirical Methods in Na...

  30. [30]

    InProceedings of the Fourth Ukrainian Natural Language Process- ing Workshop (UNLP 2025), pages 179–193, Vi- enna, Austria (online)

    Improving sentiment analysis for Ukrainian social media code-switching data. InProceedings of the Fourth Ukrainian Natural Language Process- ing Workshop (UNLP 2025), pages 179–193, Vi- enna, Austria (online). Association for Computa- tional Linguistics. Rushendra Sidibomma, Pransh Patwa, Parth Patwa, Aman Chadha, Vinija Jain, and Amitava Das. 2025. LLMsA...

  31. [31]

    Thoudam Doren Singh and Thamar Solorio

    Hiacc: Hinglish adult & children code- switched corpus.Data in Brief, page 111886. Thoudam Doren Singh and Thamar Solorio. 2017. To- wards translating mixed-code comments from social media. InInternational Conference on Computa- tional Linguistics and Intelligent Text Processing, pages 457–468. Springer. Vinay Singh, Deepanshu Vijay, Syed Sarfaraz Akhtar,...

  32. [32]

    InProceedings of the First Workshop on Computational Approaches to Code Switching, pages 62–72, Doha, Qatar

    Overview for the first shared task on language identification in code-switched data. InProceedings of the First Workshop on Computational Approaches to Code Switching, pages 62–72, Doha, Qatar. Asso- ciation for Computational Linguistics. Thamar Solorio and Yang Liu. 2008. Learning to pre- dict code-switching points. InProceedings of the 2008 Conference o...

  33. [33]

    From machine translation to code-switching: Generating high-quality code-switched text. InPro- ceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics and the 11th In- ternational Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3154– 3169, Online. Association for Computational Lin- guistics. Pa...

  34. [34]

    InFindings of the Association for Com- putational Linguistics: EMNLP 2022, pages 7133– 7141, Abu Dhabi, United Arab Emirates

    Adapting multilingual models for code-mixed translation. InFindings of the Association for Com- putational Linguistics: EMNLP 2022, pages 7133– 7141, Abu Dhabi, United Arab Emirates. Associa- tion for Computational Linguistics. Yogarshi Vyas, Spandana Gella, Jatin Sharma, Ka- lika Bali, and Monojit Choudhury. 2014. POS tag- ging of English-Hindi code-mixe...

  35. [35]

    InProceedings of the Third Workshop on Computational Approaches to Lin- guistic Code-Switching, pages 154–158, Melbourne, Australia

    Code-switched named entity recognition with embedding attention. InProceedings of the Third Workshop on Computational Approaches to Lin- guistic Code-Switching, pages 154–158, Melbourne, Australia. Association for Computational Linguis- tics. Fei Wang, Kuan-hao Huang, Anoop Kumar, Aram Galstyan, Greg Ver steeg, and Kai-wei Chang

  36. [36]

    InProceedings of the Massively Multi- lingual Natural Language Understanding Workshop (MMNLU-22), pages 53–61, Abu Dhabi, United Arab Emirates (Hybrid)

    Zero-shot cross-lingual sequence tagging as Seq2Seq generation for joint intent classification and slot filling. InProceedings of the Massively Multi- lingual Natural Language Understanding Workshop (MMNLU-22), pages 53–61, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computa- tional Linguistics. Renxi Wang, Haonan Li, Minghao Wu, Yuxia Wang,...

  37. [37]

    Association for Computational Linguistics

    Are multilingual models effective in code- switching? InProceedings of the Fifth Workshop on Computational Approaches to Linguistic Code- Switching, pages 142–153, Online. Association for Computational Linguistics. Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. 2019. Code-switched lan- guage models using neural based synthetic data ...

  38. [38]

    32x improve- ment on MLQA (Lewis et al., 2020); demonstrates power of code-mixed instruction tuning NLG TasksMT (CoMix) (Arora et al., 2023); Summariza- tion (CroCoSum) (Zhang and Eickhoff, 2024); Dialogue Generation (X-RiSAWOZ: 18K+ utterances, 12 domains) (Moradshahi et al., 2023); Creative text (puns); CS prompting CroCoSum (24K En articles, 18K Chines...

  39. [39]

    BLEU: 12.98 with 10x efficiency; GPT-4 excels at cre- ative generation (puns, translation) NLU Tasks: Clas- sification Sentiment (Dravidian- CodeMix: 60K+ com- ments) (Chakravarthi et al., 2022); Offensive detection (OffMix-3L) (Goswami et al., 2023); Hate speech; Intent classification DravidianCodeMix (Tamil-En, Kannada- En, Malayalam-En) (Chakravarthi e...

  40. [40]

    introduces novel programming task Language Cover- age & Gaps High-resource: Hindi- En, Spanish-En; Dravid- ian: Tamil-En, Kannada- En, Malayalam-En; Low-resource: Magahi- En, Maithili-En; East Asian: Chinese-En SwitchLingua (Xie et al.,

  41. [41]

    (multi-ethnic); X-RiSAWOZ (5 lan- guages) (Moradshahi et al., 2023); MuRIL corpus (17M Indian) (Goswami et al., 2023); MultiCoNER (11 lan- guages) (Malmasi et al., 2022b) Universal: XLM-R (100 langs), mBERT (104 langs); Regional: MuRIL (17 Indic), IndicBERT (12 Indic) (Goswami et al., 2023); GPT-4 (inconsistent on low-resource) Cross-lingual trans- fer; P...