pith. machine review for the scientific record. sign in

arxiv: 2604.18722 · v1 · submitted 2026-04-20 · 💻 cs.CL

Recognition: unknown

Scripts Through Time: A Survey of the Evolving Role of Transliteration in NLP

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:04 UTC · model grok-4.3

classification 💻 cs.CL
keywords transliterationcross-lingual transferscript barriercode-mixed textlanguage modelsNLP surveytransfer learning
0
0 comments X

The pith

Transliteration converts writing systems to raise lexical overlap and ease cross-lingual transfer in NLP.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The survey maps how transliteration addresses the script barrier that blocks knowledge sharing between languages in natural language processing. It builds a taxonomy of reasons to apply transliteration and surveys the main ways to feed transliterated text into models. The review traces how these methods developed, measures their results against trade-offs such as accuracy versus speed, and places them in the context of current large language models. Concrete gains show up when handling mixed-language text, when languages share a family, and when inference needs to run faster. The paper closes with direct guidance on choosing a transliteration method according to the languages, task, and available resources.

Core claim

Transliteration converts text from one script to another so that models see greater lexical overlap across languages and therefore transfer knowledge more effectively. Different ways of adding transliteration at input time have appeared over time, each carrying its own accuracy, efficiency, and coverage trade-offs that vary with the target languages and tasks.

What carries the argument

Taxonomy of motivations for transliteration together with the set of input-incorporation approaches that organize benefits across code-mixing, language-family relatedness, and inference efficiency.

If this is right

  • For code-mixed text, transliteration improves model handling by aligning mixed scripts into one representation.
  • Languages from the same family gain more from transfer when transliteration increases shared vocabulary.
  • Inference-time speed improves when transliteration replaces heavier multilingual tokenizers in large models.
  • Researchers obtain concrete selection rules that match transliteration strategy to language, task, and compute limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trade-off analysis could be reapplied to test whether transliteration still helps once models reach the scale of the newest LLMs released after the survey.
  • Low-resource languages not heavily represented in the reviewed literature could be checked to see if the same taxonomy still predicts useful strategies.
  • Combining transliteration with other cross-lingual signals such as shared subword units might produce larger gains than either method alone.

Load-bearing premise

The taxonomy and the collected studies together cover the main ways transliteration is used today without leaving out important recent work or major language settings.

What would settle it

A controlled experiment on a language pair and task outside the surveyed cases that shows transliteration either reduces accuracy or adds no measurable gain would undermine the selection recommendations.

Figures

Figures reproduced from arXiv: 2604.18722 by Deepon Halder, Raj Dabre, Thanmay Jayakumar.

Figure 1
Figure 1. Figure 1: Illustration of common transliteration am [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A taxonomy of the key motivations as to why transliterated data may be useful. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A taxonomy of the key approaches as to how transliterated data may be integrated. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Cross-lingual transfer in NLP is often hindered by the ``script barrier'' where differences in writing systems inhibit transfer learning between languages. Transliteration, the process of converting the script, has emerged as a powerful technique to bridge this gap by increasing lexical overlap. This paper provides a comprehensive survey of the application of transliteration in cross-lingual NLP. We present a taxonomy of key motivations to utilize transliterations in language models, and provide an overview of different approaches of incorporating transliterations as input. We analyze the evolution and effectiveness of these methods, discussing the critical trade-offs involved, and contextualize their need in modern LLMs. The review explores various settings that show how transliteration is beneficial, including handling code-mixed text, leveraging language family relatedness, and pragmatic gains in inference efficiency. Based on this analysis, we provide concrete recommendations for researchers on selecting and implementing the most appropriate transliteration strategy based on their specific language, task, and resource constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. This paper surveys the role of transliteration in NLP for overcoming the script barrier in cross-lingual transfer. It introduces a taxonomy of motivations for using transliteration in language models, overviews methods for incorporating transliterations as input, analyzes the evolution, effectiveness, and trade-offs of these approaches, and situates them within modern LLMs. The review highlights benefits in code-mixed text handling, leveraging language family relatedness, and inference efficiency gains, and concludes with concrete recommendations for selecting transliteration strategies based on language, task, and resource constraints.

Significance. If the taxonomy and literature synthesis prove comprehensive, the survey offers a useful organizing framework for researchers in multilingual NLP and LLMs by consolidating motivations, approaches, and practical trade-offs. It explicitly credits prior work through structured analysis and provides actionable recommendations that could aid strategy selection in resource-constrained settings, particularly where efficiency matters. The emphasis on pragmatic gains in modern LLMs is a timely contribution if recent developments are adequately covered.

major comments (1)
  1. [Taxonomy and modern LLMs contextualization] Taxonomy section and the modern LLMs contextualization (as referenced in the abstract): The central claim that transliteration yields benefits in code-mixing, language-family transfer, and inference efficiency, leading to concrete strategy recommendations, depends on the taxonomy comprehensively capturing current practice. If post-2022 literature on transliteration interactions with subword/byte-level tokenizers (e.g., in mT5, Llama, or Mistral-style models) or long-context efficiency is omitted, the generalization to current LLM settings is undermined. This is load-bearing for the recommendations in resource-constrained scenarios.
minor comments (2)
  1. [Abstract] The abstract states the survey contextualizes transliteration in modern LLMs but does not indicate the cutoff date for reviewed literature or list key recent models explicitly; adding this would help readers evaluate coverage.
  2. [Taxonomy] Figure or table presenting the taxonomy of motivations could benefit from clearer visual distinction between historical and contemporary approaches to aid quick reference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying a key area for strengthening the manuscript. We address the major comment below and will incorporate revisions to enhance the coverage of recent literature.

read point-by-point responses
  1. Referee: Taxonomy section and the modern LLMs contextualization (as referenced in the abstract): The central claim that transliteration yields benefits in code-mixing, language-family transfer, and inference efficiency, leading to concrete strategy recommendations, depends on the taxonomy comprehensively capturing current practice. If post-2022 literature on transliteration interactions with subword/byte-level tokenizers (e.g., in mT5, Llama, or Mistral-style models) or long-context efficiency is omitted, the generalization to current LLM settings is undermined. This is load-bearing for the recommendations in resource-constrained scenarios.

    Authors: We agree that robust coverage of post-2022 developments is necessary to support the recommendations for modern LLM settings. The taxonomy organizes motivations (e.g., lexical overlap for code-mixing, family relatedness, and efficiency) that are largely architecture-agnostic, and the manuscript already reviews the evolution of incorporation methods through transformer-based models while contextualizing needs in LLMs. However, to directly address the concern, we will expand the modern LLMs section with additional analysis and citations of recent works examining transliteration interactions with subword tokenizers (as in Llama and Mistral) and byte-level approaches, including any documented effects on long-context efficiency. This revision will make the generalization and practical recommendations more explicit and evidence-based without altering the core taxonomy. revision: yes

Circularity Check

0 steps flagged

No circularity: survey aggregates external studies

full rationale

This is a literature survey that presents a taxonomy of motivations, overviews approaches from cited works, analyzes trade-offs, and offers recommendations based on external evidence. No equations, fitted parameters, or self-derived predictions exist. Central claims about benefits in code-mixing, language-family transfer, and efficiency rest on reviewed literature rather than reducing to the paper's own inputs by construction. Self-citations, if present, are not load-bearing for any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper. No free parameters, axioms, or invented entities are introduced; the content rests on synthesis of previously published NLP studies.

pith-pipeline@v0.9.0 · 5467 in / 1108 out tokens · 48832 ms · 2026-05-10T05:04:21.455853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 52 canonical work pages

  1. [1]

    Enhancing Cross-Lingual Transfer through Reversible Transliteration: A H uffman-Based Approach for Low-Resource Languages

    Zhuang, Wenhao and Sun, Yuan and Zhao, Xiaobing. Enhancing Cross-Lingual Transfer through Reversible Transliteration: A H uffman-Based Approach for Low-Resource Languages. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.795

  2. [2]

    NICT ' s Participation in WAT 2018: Approaches Using Multilingualism and Recurrently Stacked Layers

    Dabre, Raj and Kunchukuttan, Anoop and Fujita, Atsushi and Sumita, Eiichiro. NICT ' s Participation in WAT 2018: Approaches Using Multilingualism and Recurrently Stacked Layers. Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation. 2018

  3. [3]

    Robust neural machine translation with joint textual and phonetic embedding

    Liu, Hairong and Ma, Mingbo and Huang, Liang and Xiong, Hao and He, Zhongjun. Robust Neural Machine Translation with Joint Textual and Phonetic Embedding. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1291

  4. [4]

    Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages

    Nakov, Preslav and Ng, Hwee Tou. Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009

  5. [5]

    R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s

    Saji, Alan and Husain, Jaavid Aktar and Jayakumar, Thanmay and Dabre, Raj and Kunchukuttan, Anoop and Puduppully, Ratish. R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1354

  6. [6]

    Hyperpolyglot LLM s: Cross-Lingual Interpretability in Token Embeddings

    Wen-Yi, Andrea W and Mimno, David. Hyperpolyglot LLM s: Cross-Lingual Interpretability in Token Embeddings. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.71

  7. [7]

    How Multilingual is Multilingual BERT ?

    Pires, Telmo and Schlinger, Eva and Garrette, Dan. How Multilingual is Multilingual BERT ?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1493

  8. [8]

    When Being Unseen from m BERT is just the Beginning: Handling New Languages With Multilingual Language Models

    Muller, Benjamin and Anastasopoulos, Antonios and Sagot, Beno \^i t and Seddah, Djam \'e. When Being Unseen from m BERT is just the Beginning: Handling New Languages With Multilingual Language Models. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10...

  9. [9]

    Pushing the Limits of Low-Resource Morphological Inflection

    Anastasopoulos, Antonios and Neubig, Graham. Pushing the Limits of Low-Resource Morphological Inflection. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1091

  10. [10]

    A Simple but Effective Approach to Improve A rabizi-to- E nglish Statistical Machine Translation

    van der Wees, Marlies and Bisazza, Arianna and Monz, Christof. A Simple but Effective Approach to Improve A rabizi-to- E nglish Statistical Machine Translation. Proceedings of the 2nd Workshop on Noisy User-generated Text ( WNUT ). 2016

  11. [11]

    Leveraging Entity Linking and Related Language Projection to Improve Name Transliteration

    Lin, Ying and Pan, Xiaoman and Deri, Aliya and Ji, Heng and Knight, Kevin. Leveraging Entity Linking and Related Language Projection to Improve Name Transliteration. Proceedings of the Sixth Named Entity Workshop. 2016. doi:10.18653/v1/W16-2701

  12. [12]

    Integrating an Unsupervised Transliteration Model into Statistical Machine Translation

    Durrani, Nadir and Sajjad, Hassan and Hoang, Hieu and Koehn, Philipp. Integrating an Unsupervised Transliteration Model into Statistical Machine Translation. Proceedings of the 14th Conference of the E uropean Chapter of the Association for Computational Linguistics, volume 2: Short Papers. 2014. doi:10.3115/v1/E14-4029

  13. [13]

    Improving machine translation via triangulation and transliteration

    Durrani, Nadir and Koehn, Philipp. Improving machine translation via triangulation and transliteration. Proceedings of the 17th Annual Conference of the European Association for Machine Translation. 2014

  14. [14]

    R omanization-based Approach to Morphological Analysis in K orean SMS Text Processing

    Kim, Youngsam and Shin, Hyopil. R omanization-based Approach to Morphological Analysis in K orean SMS Text Processing. Proceedings of the Sixth International Joint Conference on Natural Language Processing. 2013

  15. [15]

    HCCL at S em E val-2017 Task 2: Combining Multilingual Word Embeddings and Transliteration Model for Semantic Similarity

    He, Junqing and Wu, Long and Zhao, Xuemin and Yan, Yonghong. HCCL at S em E val-2017 Task 2: Combining Multilingual Word Embeddings and Transliteration Model for Semantic Similarity. Proceedings of the 11th International Workshop on Semantic Evaluation ( S em E val-2017). 2017. doi:10.18653/v1/S17-2033

  16. [16]

    Using Transliteration of Proper Names from A rabic to L atin Script to Improve E nglish- A rabic Word Alignment

    Semmar, Nasredine and Saadane, Houda. Using Transliteration of Proper Names from A rabic to L atin Script to Improve E nglish- A rabic Word Alignment. Proceedings of the Sixth International Joint Conference on Natural Language Processing. 2013

  17. [17]

    How do you pronounce your name? Improving G 2 P with transliterations

    Bhargava, Aditya and Kondrak, Grzegorz. How do you pronounce your name? Improving G 2 P with transliterations. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011

  18. [18]

    Robust Deep Learning Based Sentiment Classification of Code-Mixed Text

    Mukherjee, Siddhartha and Prasan, Vinuthkumar and Nediyanchath, Anish and Shah, Manan and Kumar, Nikhil. Robust Deep Learning Based Sentiment Classification of Code-Mixed Text. Proceedings of the 16th International Conference on Natural Language Processing. 2019

  19. [19]

    and Joanis, Eric and Kuhn, Roland and Foster, George and Popowich, Fred

    Kashani, Mehdi M. and Joanis, Eric and Kuhn, Roland and Foster, George and Popowich, Fred. Integration of an A rabic Transliteration Module into a Statistical Machine Translation System. Proceedings of the Second Workshop on Statistical Machine Translation. 2007

  20. [20]

    Cross-lingual Named Entity List Search via Transliteration

    Khakhmovich, Aleksandr and Pavlova, Svetlana and Kirillova, Kira and Arefyev, Nikolay and Savilova, Ekaterina. Cross-lingual Named Entity List Search via Transliteration. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

  21. [21]

    A rabizi sentiment analysis based on transliteration and automatic corpus annotation

    Guellil, Imane and Adeel, Ahsan and Azouaou, Faical and Benali, Fodil and Hachani, Ala-eddine and Hussain, Amir. A rabizi sentiment analysis based on transliteration and automatic corpus annotation. Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2018. doi:10.18653/v1/W18-6249

  22. [22]

    Towards Offensive Language Identification for D ravidian Languages

    Sai, Siva and Sharma, Yashvardhan. Towards Offensive Language Identification for D ravidian Languages. Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages. 2021

  23. [23]

    DE - ABUSE @ T amil NLP - ACL 2022: Transliteration as Data Augmentation for Abuse Detection in T amil

    Palanikumar, Vasanth and Benhur, Sean and Hande, Adeep and Chakravarthi, Bharathi Raja. DE - ABUSE @ T amil NLP - ACL 2022: Transliteration as Data Augmentation for Abuse Detection in T amil. Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages. 2022. doi:10.18653/v1/2022.dravidianlangtech-1.5

  24. [24]

    Hate Speech and Offensive Language Detection in B engali

    Das, Mithun and Banerjee, Somnath and Saha, Punyajoy and Mukherjee, Animesh. Hate Speech and Offensive Language Detection in B engali. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2022. doi:1...

  25. [25]

    E nglish to B engali Multimodal Neural Machine Translation using Transliteration-based Phrase Pairs Augmentation

    Laskar, Sahinur Rahman and Dadure, Pankaj and Manna, Riyanka and Pakray, Partha and Bandyopadhyay, Sivaji. E nglish to B engali Multimodal Neural Machine Translation using Transliteration-based Phrase Pairs Augmentation. Proceedings of the 9th Workshop on Asian Translation. 2022

  26. [26]

    DLRG - D ravidian L ang T ech@ EACL 2024 : Combating Hate Speech in T elugu Code-mixed Text on Social Media

    Rajalakshmi, Ratnavel and M, Saptharishee and S, Hareesh and R, Gabriel and Sr, Varsini. DLRG - D ravidian L ang T ech@ EACL 2024 : Combating Hate Speech in T elugu Code-mixed Text on Social Media. Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages. 2024

  27. [27]

    Sandalphon@ D ravidian L ang T ech- EACL 2024: Hate and Offensive Language Detection in T elugu Code-mixed Text using Transliteration-Augmentation

    Tabassum, Nafisa and Khan, Mosabbir and Ahsan, Shawly and Hossain, Jawad and Hoque, Mohammed Moshiul. Sandalphon@ D ravidian L ang T ech- EACL 2024: Hate and Offensive Language Detection in T elugu Code-mixed Text using Transliteration-Augmentation. Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages. 2024

  28. [28]

    R omanization-based Large-scale Adaptation of Multilingual Language Models

    Purkayastha, Sukannya and Ruder, Sebastian and Pfeiffer, Jonas and Gurevych, Iryna and Vuli \'c , Ivan. R omanization-based Large-scale Adaptation of Multilingual Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.538

  29. [29]

    R o M antra: Optimizing Neural Machine Translation for Low-Resource Languages through R omanization

    Soni, Govind and Bhattacharyya, Pushpak. R o M antra: Optimizing Neural Machine Translation for Low-Resource Languages through R omanization. Proceedings of the 21st International Conference on Natural Language Processing (ICON). 2024

  30. [30]

    R oman S etu: Efficiently unlocking multilingual capabilities of Large Language Models via R omanization

    Husain, Jaavid and Dabre, Raj and M, Aswanth and Gala, Jay and Jayakumar, Thanmay and Puduppully, Ratish and Kunchukuttan, Anoop. R oman S etu: Efficiently unlocking multilingual capabilities of Large Language Models via R omanization. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. do...

  31. [31]

    Jailbreaking LLM s with A rabic Transliteration and A rabizi

    Al Ghanim, Mansour and Almohaimeed, Saleh and Zheng, Mengxin and Solihin, Yan and Lou, Qian. Jailbreaking LLM s with A rabic Transliteration and A rabizi. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1034

  32. [32]

    Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non- L atin Scripts

    Ma, Chunlan and Liu, Yihong and Ye, Haotian and Schuetze, Hinrich. Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non- L atin Scripts. Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025). 2025. doi:10.18653/v1/2025.mrl-main.27

  33. [33]

    On R omanization for Model Transfer Between Scripts in Neural Machine Translation

    Amrhein, Chantal and Sennrich, Rico. On R omanization for Model Transfer Between Scripts in Neural Machine Translation. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.223

  34. [34]

    Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s

    Nag, Arijit and Mukherjee, Animesh and Ganguly, Niloy and Chakrabarti, Soumen. Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.920

  35. [35]
  36. [36]

    Unsupervised Machine Translation On D ravidian Languages

    Koneru, Sai and Liu, Danni and Niehues, Jan. Unsupervised Machine Translation On D ravidian Languages. Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages. 2021

  37. [37]

    Unsupervised Cross-lingual Representation Learning at Scale

    Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm \'a n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ...

  38. [38]

    2017 , publisher=

    Pinyin as subword unit for chinese-sourced neural machine translation , author=. 2017 , publisher=

  39. [39]

    XMU Neural Machine Translation Systems for WAT 2018 M yanmar- E nglish Translation Task

    Wang, Boli and Hu, Jinming and Chen, Yidong and Shi, Xiaodong. XMU Neural Machine Translation Systems for WAT 2018 M yanmar- E nglish Translation Task. Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation. 2018

  40. [40]

    Arabic–Chinese Neural Machine Translation: Romanized Arabic as Subword Unit for Arabic-sourced Translation , year=

    Aqlan, Fares and Fan, Xiaoping and Alqwbani, Abdullah and Al-Mansoub, Akram , journal=. Arabic–Chinese Neural Machine Translation: Romanized Arabic as Subword Unit for Arabic-sourced Translation , year=

  41. [41]

    arXiv preprint arXiv:1909.06516 , year=

    A universal parent model for low-resource neural machine translation transfer , author=. arXiv preprint arXiv:1909.06516 , year=

  42. [42]

    Name Translation based on Fine-grained Named Entity Recognition in a Single Language

    Sadamitsu, Kugatsu and Saito, Itsumi and Katayama, Taichi and Asano, Hisako and Matsuo, Yoshihiro. Name Translation based on Fine-grained Named Entity Recognition in a Single Language. Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16). 2016

  43. [43]

    and Kunchukuttan, Anoop and Kumar, Pratyush

    Doddapaneni, Sumanth and Aralikatte, Rahul and Ramesh, Gowtham and Goyal, Shreya and Khapra, Mitesh M. and Kunchukuttan, Anoop and Kumar, Pratyush. Towards Leaving No I ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for I ndic Languages. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volu...

  44. [44]

    I ndic BART : A Pre-trained Model for Indic Natural Language Generation

    Dabre, Raj and Shrotriya, Himani and Kunchukuttan, Anoop and Puduppully, Ratish and Khapra, Mitesh and Kumar, Pratyush. I ndic BART : A Pre-trained Model for Indic Natural Language Generation. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.145

  45. [45]

    Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: A n I ndic Languages Study

    Khemchandani, Yash and Mehtani, Sarvesh and Patil, Vaidehi and Awasthi, Abhijeet and Talukdar, Partha and Sarawagi, Sunita. Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: A n I ndic Languages Study. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conf...

  46. [46]

    Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages

    Goyal, Vikrant and Kumar, Sourav and Sharma, Dipti Misra. Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 2020. doi:10.18653/v1/2020.acl-srw.22

  47. [47]

    Role of L anguage R elatedness in M ultilingual F ine-tuning of L anguage M odels: A C ase S tudy in I ndo- A ryan L anguages

    Dhamecha, Tejas and Murthy, Rudra and Bharadwaj, Samarth and Sankaranarayanan, Karthik and Bhattacharyya, Pushpak. Role of L anguage R elatedness in M ultilingual F ine-tuning of L anguage M odels: A C ase S tudy in I ndo- A ryan L anguages. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021....

  48. [48]

    Does Transliteration Help Multilingual Language Modeling?

    Moosa, Ibraheem Muhammad and Akhter, Mahmud Elahi and Habib, Ashfia Binte. Does Transliteration Help Multilingual Language Modeling?. Findings of the Association for Computational Linguistics: EACL 2023. 2023. doi:10.18653/v1/2023.findings-eacl.50

  49. [49]

    The U niversity of M aryland ' s K azakh- E nglish Neural Machine Translation System at WMT 19

    Briakou, Eleftheria and Carpuat, Marine. The U niversity of M aryland ' s K azakh- E nglish Neural Machine Translation System at WMT 19. Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019. doi:10.18653/v1/W19-5308

  50. [50]

    Pre-training via Leveraging Assisting Languages for Neural Machine Translation

    Song, Haiyue and Dabre, Raj and Mao, Zhuoyuan and Cheng, Fei and Kurohashi, Sadao and Sumita, Eiichiro. Pre-training via Leveraging Assisting Languages for Neural Machine Translation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 2020. doi:10.18653/v1/2020.acl-srw.37

  51. [51]

    Transliteration for Cross-Lingual Morphological Inflection

    Murikinati, Nikitha and Anastasopoulos, Antonios and Neubig, Graham. Transliteration for Cross-Lingual Morphological Inflection. Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. 2020. doi:10.18653/v1/2020.sigmorphon-1.22

  52. [52]

    Language Relatedness and Lexical Closeness can help Improve Multilingual NMT : IITB ombay@ M ulti I ndic NMT WAT 2021

    Khatri, Jyotsana and Saini, Nikhil and Bhattacharyya, Pushpak. Language Relatedness and Lexical Closeness can help Improve Multilingual NMT : IITB ombay@ M ulti I ndic NMT WAT 2021. Proceedings of the 8th Workshop on Asian Translation (WAT2021). 2021. doi:10.18653/v1/2021.wat-1.26

  53. [53]

    A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages

    Vania, Clara and Kementchedjhieva, Yova and S gaard, Anders and Lopez, Adam. A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...

  54. [54]

    Exploring the Impact of Transliteration on NLP Performance: Treating M altese as an A rabic Dialect

    Micallef, Kurt and Eryani, Fadhl and Habash, Nizar and Bouamor, Houda and Borg, Claudia. Exploring the Impact of Transliteration on NLP Performance: Treating M altese as an A rabic Dialect. Proceedings of the Workshop on Computation and Written Language (CAWL 2023). 2023. doi:10.18653/v1/2023.cawl-1.4

  55. [55]

    M ai NLP at S em E val-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness

    Zhou, Shijia and Shan, Huangyan and Plank, Barbara and Litschko, Robert. M ai NLP at S em E val-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness. Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024). 2024. doi:10.18653/v1/2024.semeval-1.259

  56. [56]

    Cross-lingual Transfer Learning for J apanese Named Entity Recognition

    Johnson, Andrew and Karanasou, Penny and Gaspers, Judith and Klakow, Dietrich. Cross-lingual Transfer Learning for J apanese Named Entity Recognition. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers). 2019. doi:10.18653/v1/N19-2023

  57. [57]

    Rijhwani, Shruti and Xie, Jiateng and Neubig, Graham and Carbonell, Jaime , title =. Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence , articleno =. 2019 , isbn =. doi:10.1609/...

  58. [58]

    A Novel Approach towards Cross Lingual Sentiment Analysis using Transliteration and Character Embedding

    Roychoudhury, Rajarshi and Dey, Subhrajit and Akhtar, Md and Das, Amitava and Naskar, Sudip. A Novel Approach towards Cross Lingual Sentiment Analysis using Transliteration and Character Embedding. Proceedings of the 19th International Conference on Natural Language Processing (ICON). 2022

  59. [59]

    Multiple Character Embeddings for C hinese Word Segmentation

    Zhou, Jianing and Wang, Jingkang and Liu, Gongshen. Multiple Character Embeddings for C hinese Word Segmentation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 2019. doi:10.18653/v1/P19-2029

  60. [60]

    Putting Figures on Influences on M oroccan D arija from A rabic, F rench and S panish using the W ord N et

    Mrini, Khalil and Bond, Francis. Putting Figures on Influences on M oroccan D arija from A rabic, F rench and S panish using the W ord N et. Proceedings of the 9th Global Wordnet Conference. 2018

  61. [61]

    and Smith, Noah A

    Chau, Ethan C. and Smith, Noah A. Specializing Multilingual Language Models: An Empirical Study. Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021. doi:10.18653/v1/2021.mrl-1.5

  62. [62]

    Alternative Input Signals Ease Transfer in Multilingual Machine Translation

    Sun, Simeng and Fan, Angela and Cross, James and Chaudhary, Vishrav and Tran, Chau and Koehn, Philipp and Guzm \'a n, Francisco. Alternative Input Signals Ease Transfer in Multilingual Machine Translation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.363

  63. [63]

    T ransli C o: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models

    Liu, Yihong and Ma, Chunlan and Ye, Haotian and Schuetze, Hinrich. T ransli C o: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.136

  64. [64]

    arXiv preprint arXiv:2409.17326 , year=

    How transliterations improve crosslingual alignment , author=. arXiv preprint arXiv:2409.17326 , year=

  65. [65]

    Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

    Xhelili, Orgest and Liu, Yihong and Schuetze, Hinrich. Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.659

  66. [66]

    The Effect of Model Capacity and Script Diversity on Subword Tokenization for S orani K urdish

    Salehi, Ali and Jacobs, Cassandra L. The Effect of Model Capacity and Script Diversity on Subword Tokenization for S orani K urdish. Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology. 2024. doi:10.18653/v1/2024.sigmorphon-1.6

  67. [67]

    S cript M ix: Mixing Scripts for Low-resource Language Parsing

    Lee, Jaeseong and Lee, Dohyeon and Hwang, Seung-won. S cript M ix: Mixing Scripts for Low-resource Language Parsing. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.357

  68. [68]

    T rans MI : A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

    Liu, Yihong and Ma, Chunlan and Ye, Haotian and Sch. T rans MI : A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  69. [69]

    Input Combination Strategies for Multi-Source Transformer Decoder

    Libovick \'y , Jind r ich and Helcl, Jind r ich and Mare c ek, David. Input Combination Strategies for Multi-Source Transformer Decoder. Proceedings of the Third Conference on Machine Translation: Research Papers. 2018. doi:10.18653/v1/W18-6326

  70. [70]

    Emerging Cross-lingual Structure in Pretrained Language Models

    Conneau, Alexis and Wu, Shijie and Li, Haoran and Zettlemoyer, Luke and Stoyanov, Veselin. Emerging Cross-lingual Structure in Pretrained Language Models. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.536

  71. [71]

    Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

    Lost in Transliteration: Bridging the Script Gap in Neural IR , author=. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

  72. [72]

    and Krikun, Maxim and Wu, Yonghui and Chen, Zhifeng and Thorat, Nikhil and Viégas, Fernanda and Wattenberg, Martin and Corrado, Greg , year =

    Johnson, Melvin and Schuster, Mike and Le, Quoc V. and Krikun, Maxim and Wu, Yonghui and Chen, Zhifeng and Thorat, Nikhil and Vi \'e gas, Fernanda and Wattenberg, Martin and Corrado, Greg and Hughes, Macduff and Dean, Jeffrey. G oogle ' s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Co...

  73. [73]

    civilizing

    Zoph, Barret and Yuret, Deniz and May, Jonathan and Knight, Kevin. Transfer Learning for Low-Resource Neural Machine Translation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1163

  74. [74]

    Cross-lingual Language Model Pretraining , url =

    Conneau, Alexis and Lample, Guillaume , booktitle =. Cross-lingual Language Model Pretraining , url =

  75. [75]

    Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

    Jung, Haeji and Kim, Jinju and Kim, Kyungjin and Roh, Youjeong and Mortensen, David R. Happiness is Sharing a Vocabulary: A Study of Transliteration Methods. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.365

  76. [76]

    Towards a Common Understanding of Contributing Factors for Cross-Lingual Transfer in Multilingual Language Models: A Review

    Philippy, Fred and Guo, Siwen and Haddadan, Shohreh. Towards a Common Understanding of Contributing Factors for Cross-Lingual Transfer in Multilingual Language Models: A Review. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.323

  77. [77]

    The Decades Progress on Code-Switching Research in NLP : A Systematic Survey on Trends and Challenges

    Winata, Genta and Aji, Alham Fikri and Yong, Zheng Xin and Solorio, Thamar. The Decades Progress on Code-Switching Research in NLP : A Systematic Survey on Trends and Challenges. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.185

  78. [78]

    2026 , eprint=

    Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models across Modalities , author=. 2026 , eprint=