arxiv: 2604.18722 · v1 · submitted 2026-04-20 · 💻 cs.CL

Recognition: unknown

Scripts Through Time: A Survey of the Evolving Role of Transliteration in NLP

Thanmay Jayakumar , Deepon Halder , Raj Dabre

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords transliterationcross-lingual transferscript barriercode-mixed textlanguage modelsNLP surveytransfer learning

0 comments

The pith

Transliteration converts writing systems to raise lexical overlap and ease cross-lingual transfer in NLP.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The survey maps how transliteration addresses the script barrier that blocks knowledge sharing between languages in natural language processing. It builds a taxonomy of reasons to apply transliteration and surveys the main ways to feed transliterated text into models. The review traces how these methods developed, measures their results against trade-offs such as accuracy versus speed, and places them in the context of current large language models. Concrete gains show up when handling mixed-language text, when languages share a family, and when inference needs to run faster. The paper closes with direct guidance on choosing a transliteration method according to the languages, task, and available resources.

Core claim

Transliteration converts text from one script to another so that models see greater lexical overlap across languages and therefore transfer knowledge more effectively. Different ways of adding transliteration at input time have appeared over time, each carrying its own accuracy, efficiency, and coverage trade-offs that vary with the target languages and tasks.

What carries the argument

Taxonomy of motivations for transliteration together with the set of input-incorporation approaches that organize benefits across code-mixing, language-family relatedness, and inference efficiency.

If this is right

For code-mixed text, transliteration improves model handling by aligning mixed scripts into one representation.
Languages from the same family gain more from transfer when transliteration increases shared vocabulary.
Inference-time speed improves when transliteration replaces heavier multilingual tokenizers in large models.
Researchers obtain concrete selection rules that match transliteration strategy to language, task, and compute limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trade-off analysis could be reapplied to test whether transliteration still helps once models reach the scale of the newest LLMs released after the survey.
Low-resource languages not heavily represented in the reviewed literature could be checked to see if the same taxonomy still predicts useful strategies.
Combining transliteration with other cross-lingual signals such as shared subword units might produce larger gains than either method alone.

Load-bearing premise

The taxonomy and the collected studies together cover the main ways transliteration is used today without leaving out important recent work or major language settings.

What would settle it

A controlled experiment on a language pair and task outside the surveyed cases that shows transliteration either reduces accuracy or adds no measurable gain would undermine the selection recommendations.

Figures

Figures reproduced from arXiv: 2604.18722 by Deepon Halder, Raj Dabre, Thanmay Jayakumar.

**Figure 2.** Figure 2: A taxonomy of the key motivations as to why transliterated data may be useful. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: A taxonomy of the key approaches as to how transliterated data may be integrated. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Cross-lingual transfer in NLP is often hindered by the ``script barrier'' where differences in writing systems inhibit transfer learning between languages. Transliteration, the process of converting the script, has emerged as a powerful technique to bridge this gap by increasing lexical overlap. This paper provides a comprehensive survey of the application of transliteration in cross-lingual NLP. We present a taxonomy of key motivations to utilize transliterations in language models, and provide an overview of different approaches of incorporating transliterations as input. We analyze the evolution and effectiveness of these methods, discussing the critical trade-offs involved, and contextualize their need in modern LLMs. The review explores various settings that show how transliteration is beneficial, including handling code-mixed text, leveraging language family relatedness, and pragmatic gains in inference efficiency. Based on this analysis, we provide concrete recommendations for researchers on selecting and implementing the most appropriate transliteration strategy based on their specific language, task, and resource constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. This paper surveys the role of transliteration in NLP for overcoming the script barrier in cross-lingual transfer. It introduces a taxonomy of motivations for using transliteration in language models, overviews methods for incorporating transliterations as input, analyzes the evolution, effectiveness, and trade-offs of these approaches, and situates them within modern LLMs. The review highlights benefits in code-mixed text handling, leveraging language family relatedness, and inference efficiency gains, and concludes with concrete recommendations for selecting transliteration strategies based on language, task, and resource constraints.

Significance. If the taxonomy and literature synthesis prove comprehensive, the survey offers a useful organizing framework for researchers in multilingual NLP and LLMs by consolidating motivations, approaches, and practical trade-offs. It explicitly credits prior work through structured analysis and provides actionable recommendations that could aid strategy selection in resource-constrained settings, particularly where efficiency matters. The emphasis on pragmatic gains in modern LLMs is a timely contribution if recent developments are adequately covered.

major comments (1)

[Taxonomy and modern LLMs contextualization] Taxonomy section and the modern LLMs contextualization (as referenced in the abstract): The central claim that transliteration yields benefits in code-mixing, language-family transfer, and inference efficiency, leading to concrete strategy recommendations, depends on the taxonomy comprehensively capturing current practice. If post-2022 literature on transliteration interactions with subword/byte-level tokenizers (e.g., in mT5, Llama, or Mistral-style models) or long-context efficiency is omitted, the generalization to current LLM settings is undermined. This is load-bearing for the recommendations in resource-constrained scenarios.

minor comments (2)

[Abstract] The abstract states the survey contextualizes transliteration in modern LLMs but does not indicate the cutoff date for reviewed literature or list key recent models explicitly; adding this would help readers evaluate coverage.
[Taxonomy] Figure or table presenting the taxonomy of motivations could benefit from clearer visual distinction between historical and contemporary approaches to aid quick reference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying a key area for strengthening the manuscript. We address the major comment below and will incorporate revisions to enhance the coverage of recent literature.

read point-by-point responses

Referee: Taxonomy section and the modern LLMs contextualization (as referenced in the abstract): The central claim that transliteration yields benefits in code-mixing, language-family transfer, and inference efficiency, leading to concrete strategy recommendations, depends on the taxonomy comprehensively capturing current practice. If post-2022 literature on transliteration interactions with subword/byte-level tokenizers (e.g., in mT5, Llama, or Mistral-style models) or long-context efficiency is omitted, the generalization to current LLM settings is undermined. This is load-bearing for the recommendations in resource-constrained scenarios.

Authors: We agree that robust coverage of post-2022 developments is necessary to support the recommendations for modern LLM settings. The taxonomy organizes motivations (e.g., lexical overlap for code-mixing, family relatedness, and efficiency) that are largely architecture-agnostic, and the manuscript already reviews the evolution of incorporation methods through transformer-based models while contextualizing needs in LLMs. However, to directly address the concern, we will expand the modern LLMs section with additional analysis and citations of recent works examining transliteration interactions with subword tokenizers (as in Llama and Mistral) and byte-level approaches, including any documented effects on long-context efficiency. This revision will make the generalization and practical recommendations more explicit and evidence-based without altering the core taxonomy. revision: yes

Circularity Check

0 steps flagged

No circularity: survey aggregates external studies

full rationale

This is a literature survey that presents a taxonomy of motivations, overviews approaches from cited works, analyzes trade-offs, and offers recommendations based on external evidence. No equations, fitted parameters, or self-derived predictions exist. Central claims about benefits in code-mixing, language-family transfer, and efficiency rest on reviewed literature rather than reducing to the paper's own inputs by construction. Self-citations, if present, are not load-bearing for any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper. No free parameters, axioms, or invented entities are introduced; the content rests on synthesis of previously published NLP studies.

pith-pipeline@v0.9.0 · 5467 in / 1108 out tokens · 48832 ms · 2026-05-10T05:04:21.455853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 52 canonical work pages

[1]

Enhancing Cross-Lingual Transfer through Reversible Transliteration: A H uffman-Based Approach for Low-Resource Languages

Zhuang, Wenhao and Sun, Yuan and Zhao, Xiaobing. Enhancing Cross-Lingual Transfer through Reversible Transliteration: A H uffman-Based Approach for Low-Resource Languages. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.795

work page doi:10.18653/v1/2025.acl-long.795 2025
[2]

NICT ' s Participation in WAT 2018: Approaches Using Multilingualism and Recurrently Stacked Layers

Dabre, Raj and Kunchukuttan, Anoop and Fujita, Atsushi and Sumita, Eiichiro. NICT ' s Participation in WAT 2018: Approaches Using Multilingualism and Recurrently Stacked Layers. Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation. 2018

2018
[3]

Robust neural machine translation with joint textual and phonetic embedding

Liu, Hairong and Ma, Mingbo and Huang, Liang and Xiong, Hao and He, Zhongjun. Robust Neural Machine Translation with Joint Textual and Phonetic Embedding. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1291

work page doi:10.18653/v1/p19-1291 2019
[4]

Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages

Nakov, Preslav and Ng, Hwee Tou. Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009

2009
[5]

R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s

Saji, Alan and Husain, Jaavid Aktar and Jayakumar, Thanmay and Dabre, Raj and Kunchukuttan, Anoop and Puduppully, Ratish. R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1354

work page doi:10.18653/v1/2025.findings-acl.1354 2025
[6]

Hyperpolyglot LLM s: Cross-Lingual Interpretability in Token Embeddings

Wen-Yi, Andrea W and Mimno, David. Hyperpolyglot LLM s: Cross-Lingual Interpretability in Token Embeddings. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.71

work page doi:10.18653/v1/2023.emnlp-main.71 2023
[7]

How Multilingual is Multilingual BERT ?

Pires, Telmo and Schlinger, Eva and Garrette, Dan. How Multilingual is Multilingual BERT ?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1493

work page doi:10.18653/v1/p19-1493 2019
[8]

When Being Unseen from m BERT is just the Beginning: Handling New Languages With Multilingual Language Models

Muller, Benjamin and Anastasopoulos, Antonios and Sagot, Beno \^i t and Seddah, Djam \'e. When Being Unseen from m BERT is just the Beginning: Handling New Languages With Multilingual Language Models. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10...

work page doi:10.18653/v1/2021.naacl-main.38 2021
[9]

Pushing the Limits of Low-Resource Morphological Inflection

Anastasopoulos, Antonios and Neubig, Graham. Pushing the Limits of Low-Resource Morphological Inflection. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1091

work page doi:10.18653/v1/d19-1091 2019
[10]

A Simple but Effective Approach to Improve A rabizi-to- E nglish Statistical Machine Translation

van der Wees, Marlies and Bisazza, Arianna and Monz, Christof. A Simple but Effective Approach to Improve A rabizi-to- E nglish Statistical Machine Translation. Proceedings of the 2nd Workshop on Noisy User-generated Text ( WNUT ). 2016

2016
[11]

Leveraging Entity Linking and Related Language Projection to Improve Name Transliteration

Lin, Ying and Pan, Xiaoman and Deri, Aliya and Ji, Heng and Knight, Kevin. Leveraging Entity Linking and Related Language Projection to Improve Name Transliteration. Proceedings of the Sixth Named Entity Workshop. 2016. doi:10.18653/v1/W16-2701

work page doi:10.18653/v1/w16-2701 2016
[12]

Integrating an Unsupervised Transliteration Model into Statistical Machine Translation

Durrani, Nadir and Sajjad, Hassan and Hoang, Hieu and Koehn, Philipp. Integrating an Unsupervised Transliteration Model into Statistical Machine Translation. Proceedings of the 14th Conference of the E uropean Chapter of the Association for Computational Linguistics, volume 2: Short Papers. 2014. doi:10.3115/v1/E14-4029

work page doi:10.3115/v1/e14-4029 2014
[13]

Improving machine translation via triangulation and transliteration

Durrani, Nadir and Koehn, Philipp. Improving machine translation via triangulation and transliteration. Proceedings of the 17th Annual Conference of the European Association for Machine Translation. 2014

2014
[14]

R omanization-based Approach to Morphological Analysis in K orean SMS Text Processing

Kim, Youngsam and Shin, Hyopil. R omanization-based Approach to Morphological Analysis in K orean SMS Text Processing. Proceedings of the Sixth International Joint Conference on Natural Language Processing. 2013

2013
[15]

HCCL at S em E val-2017 Task 2: Combining Multilingual Word Embeddings and Transliteration Model for Semantic Similarity

He, Junqing and Wu, Long and Zhao, Xuemin and Yan, Yonghong. HCCL at S em E val-2017 Task 2: Combining Multilingual Word Embeddings and Transliteration Model for Semantic Similarity. Proceedings of the 11th International Workshop on Semantic Evaluation ( S em E val-2017). 2017. doi:10.18653/v1/S17-2033

work page doi:10.18653/v1/s17-2033 2017
[16]

Using Transliteration of Proper Names from A rabic to L atin Script to Improve E nglish- A rabic Word Alignment

Semmar, Nasredine and Saadane, Houda. Using Transliteration of Proper Names from A rabic to L atin Script to Improve E nglish- A rabic Word Alignment. Proceedings of the Sixth International Joint Conference on Natural Language Processing. 2013

2013
[17]

How do you pronounce your name? Improving G 2 P with transliterations

Bhargava, Aditya and Kondrak, Grzegorz. How do you pronounce your name? Improving G 2 P with transliterations. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011

2011
[18]

Robust Deep Learning Based Sentiment Classification of Code-Mixed Text

Mukherjee, Siddhartha and Prasan, Vinuthkumar and Nediyanchath, Anish and Shah, Manan and Kumar, Nikhil. Robust Deep Learning Based Sentiment Classification of Code-Mixed Text. Proceedings of the 16th International Conference on Natural Language Processing. 2019

2019
[19]

and Joanis, Eric and Kuhn, Roland and Foster, George and Popowich, Fred

Kashani, Mehdi M. and Joanis, Eric and Kuhn, Roland and Foster, George and Popowich, Fred. Integration of an A rabic Transliteration Module into a Statistical Machine Translation System. Proceedings of the Second Workshop on Statistical Machine Translation. 2007

2007
[20]

Cross-lingual Named Entity List Search via Transliteration

Khakhmovich, Aleksandr and Pavlova, Svetlana and Kirillova, Kira and Arefyev, Nikolay and Savilova, Ekaterina. Cross-lingual Named Entity List Search via Transliteration. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

2020
[21]

A rabizi sentiment analysis based on transliteration and automatic corpus annotation

Guellil, Imane and Adeel, Ahsan and Azouaou, Faical and Benali, Fodil and Hachani, Ala-eddine and Hussain, Amir. A rabizi sentiment analysis based on transliteration and automatic corpus annotation. Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2018. doi:10.18653/v1/W18-6249

work page doi:10.18653/v1/w18-6249 2018
[22]

Towards Offensive Language Identification for D ravidian Languages

Sai, Siva and Sharma, Yashvardhan. Towards Offensive Language Identification for D ravidian Languages. Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages. 2021

2021
[23]

DE - ABUSE @ T amil NLP - ACL 2022: Transliteration as Data Augmentation for Abuse Detection in T amil

Palanikumar, Vasanth and Benhur, Sean and Hande, Adeep and Chakravarthi, Bharathi Raja. DE - ABUSE @ T amil NLP - ACL 2022: Transliteration as Data Augmentation for Abuse Detection in T amil. Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages. 2022. doi:10.18653/v1/2022.dravidianlangtech-1.5

work page doi:10.18653/v1/2022.dravidianlangtech-1.5 2022
[24]

Hate Speech and Offensive Language Detection in B engali

Das, Mithun and Banerjee, Somnath and Saha, Punyajoy and Mukherjee, Animesh. Hate Speech and Offensive Language Detection in B engali. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2022. doi:1...

work page doi:10.18653/v1/2022.aacl-main.23 2022
[25]

E nglish to B engali Multimodal Neural Machine Translation using Transliteration-based Phrase Pairs Augmentation

Laskar, Sahinur Rahman and Dadure, Pankaj and Manna, Riyanka and Pakray, Partha and Bandyopadhyay, Sivaji. E nglish to B engali Multimodal Neural Machine Translation using Transliteration-based Phrase Pairs Augmentation. Proceedings of the 9th Workshop on Asian Translation. 2022

2022
[26]

DLRG - D ravidian L ang T ech@ EACL 2024 : Combating Hate Speech in T elugu Code-mixed Text on Social Media

Rajalakshmi, Ratnavel and M, Saptharishee and S, Hareesh and R, Gabriel and Sr, Varsini. DLRG - D ravidian L ang T ech@ EACL 2024 : Combating Hate Speech in T elugu Code-mixed Text on Social Media. Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages. 2024

2024
[27]

Sandalphon@ D ravidian L ang T ech- EACL 2024: Hate and Offensive Language Detection in T elugu Code-mixed Text using Transliteration-Augmentation

Tabassum, Nafisa and Khan, Mosabbir and Ahsan, Shawly and Hossain, Jawad and Hoque, Mohammed Moshiul. Sandalphon@ D ravidian L ang T ech- EACL 2024: Hate and Offensive Language Detection in T elugu Code-mixed Text using Transliteration-Augmentation. Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages. 2024

2024
[28]

R omanization-based Large-scale Adaptation of Multilingual Language Models

Purkayastha, Sukannya and Ruder, Sebastian and Pfeiffer, Jonas and Gurevych, Iryna and Vuli \'c , Ivan. R omanization-based Large-scale Adaptation of Multilingual Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.538

work page doi:10.18653/v1/2023.findings-emnlp.538 2023
[29]

R o M antra: Optimizing Neural Machine Translation for Low-Resource Languages through R omanization

Soni, Govind and Bhattacharyya, Pushpak. R o M antra: Optimizing Neural Machine Translation for Low-Resource Languages through R omanization. Proceedings of the 21st International Conference on Natural Language Processing (ICON). 2024

2024
[30]

R oman S etu: Efficiently unlocking multilingual capabilities of Large Language Models via R omanization

Husain, Jaavid and Dabre, Raj and M, Aswanth and Gala, Jay and Jayakumar, Thanmay and Puduppully, Ratish and Kunchukuttan, Anoop. R oman S etu: Efficiently unlocking multilingual capabilities of Large Language Models via R omanization. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. do...

work page doi:10.18653/v1/2024.acl-long.833 2024
[31]

Jailbreaking LLM s with A rabic Transliteration and A rabizi

Al Ghanim, Mansour and Almohaimeed, Saleh and Zheng, Mengxin and Solihin, Yan and Lou, Qian. Jailbreaking LLM s with A rabic Transliteration and A rabizi. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1034

work page doi:10.18653/v1/2024.emnlp-main.1034 2024
[32]

Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non- L atin Scripts

Ma, Chunlan and Liu, Yihong and Ye, Haotian and Schuetze, Hinrich. Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non- L atin Scripts. Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025). 2025. doi:10.18653/v1/2025.mrl-main.27

work page doi:10.18653/v1/2025.mrl-main.27 2025
[33]

On R omanization for Model Transfer Between Scripts in Neural Machine Translation

Amrhein, Chantal and Sennrich, Rico. On R omanization for Model Transfer Between Scripts in Neural Machine Translation. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.223

work page doi:10.18653/v1/2020.findings-emnlp.223 2020
[34]

Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s

Nag, Arijit and Mukherjee, Animesh and Ganguly, Niloy and Chakrabarti, Soumen. Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.920

work page doi:10.18653/v1/2024.findings-emnlp.920 2024
[35]

Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, and 1 others

Muril: Multilingual representations for indian languages , author=. arXiv preprint arXiv:2103.10730 , year=

work page arXiv
[36]

Unsupervised Machine Translation On D ravidian Languages

Koneru, Sai and Liu, Danni and Niehues, Jan. Unsupervised Machine Translation On D ravidian Languages. Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages. 2021

2021
[37]

Unsupervised Cross-lingual Representation Learning at Scale

Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm \'a n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ...

work page doi:10.18653/v1/2020.acl-main.747 2020
[38]

2017 , publisher=

Pinyin as subword unit for chinese-sourced neural machine translation , author=. 2017 , publisher=

2017
[39]

XMU Neural Machine Translation Systems for WAT 2018 M yanmar- E nglish Translation Task

Wang, Boli and Hu, Jinming and Chen, Yidong and Shi, Xiaodong. XMU Neural Machine Translation Systems for WAT 2018 M yanmar- E nglish Translation Task. Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation. 2018

2018
[40]

Arabic–Chinese Neural Machine Translation: Romanized Arabic as Subword Unit for Arabic-sourced Translation , year=

Aqlan, Fares and Fan, Xiaoping and Alqwbani, Abdullah and Al-Mansoub, Akram , journal=. Arabic–Chinese Neural Machine Translation: Romanized Arabic as Subword Unit for Arabic-sourced Translation , year=
[41]

arXiv preprint arXiv:1909.06516 , year=

A universal parent model for low-resource neural machine translation transfer , author=. arXiv preprint arXiv:1909.06516 , year=

work page arXiv 1909
[42]

Name Translation based on Fine-grained Named Entity Recognition in a Single Language

Sadamitsu, Kugatsu and Saito, Itsumi and Katayama, Taichi and Asano, Hisako and Matsuo, Yoshihiro. Name Translation based on Fine-grained Named Entity Recognition in a Single Language. Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16). 2016

2016
[43]

and Kunchukuttan, Anoop and Kumar, Pratyush

Doddapaneni, Sumanth and Aralikatte, Rahul and Ramesh, Gowtham and Goyal, Shreya and Khapra, Mitesh M. and Kunchukuttan, Anoop and Kumar, Pratyush. Towards Leaving No I ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for I ndic Languages. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volu...

work page doi:10.18653/v1/2023.acl-long.693 2023
[44]

I ndic BART : A Pre-trained Model for Indic Natural Language Generation

Dabre, Raj and Shrotriya, Himani and Kunchukuttan, Anoop and Puduppully, Ratish and Khapra, Mitesh and Kumar, Pratyush. I ndic BART : A Pre-trained Model for Indic Natural Language Generation. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.145

work page doi:10.18653/v1/2022.findings-acl.145 2022
[45]

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: A n I ndic Languages Study

Khemchandani, Yash and Mehtani, Sarvesh and Patil, Vaidehi and Awasthi, Abhijeet and Talukdar, Partha and Sarawagi, Sunita. Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: A n I ndic Languages Study. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conf...

work page doi:10.18653/v1/2021.acl-long.105 2021
[46]

Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages

Goyal, Vikrant and Kumar, Sourav and Sharma, Dipti Misra. Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 2020. doi:10.18653/v1/2020.acl-srw.22

work page doi:10.18653/v1/2020.acl-srw.22 2020
[47]

Role of L anguage R elatedness in M ultilingual F ine-tuning of L anguage M odels: A C ase S tudy in I ndo- A ryan L anguages

Dhamecha, Tejas and Murthy, Rudra and Bharadwaj, Samarth and Sankaranarayanan, Karthik and Bhattacharyya, Pushpak. Role of L anguage R elatedness in M ultilingual F ine-tuning of L anguage M odels: A C ase S tudy in I ndo- A ryan L anguages. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021....

work page doi:10.18653/v1/2021.emnlp-main.675 2021
[48]

Does Transliteration Help Multilingual Language Modeling?

Moosa, Ibraheem Muhammad and Akhter, Mahmud Elahi and Habib, Ashfia Binte. Does Transliteration Help Multilingual Language Modeling?. Findings of the Association for Computational Linguistics: EACL 2023. 2023. doi:10.18653/v1/2023.findings-eacl.50

work page doi:10.18653/v1/2023.findings-eacl.50 2023
[49]

The U niversity of M aryland ' s K azakh- E nglish Neural Machine Translation System at WMT 19

Briakou, Eleftheria and Carpuat, Marine. The U niversity of M aryland ' s K azakh- E nglish Neural Machine Translation System at WMT 19. Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). 2019. doi:10.18653/v1/W19-5308

work page doi:10.18653/v1/w19-5308 2019
[50]

Pre-training via Leveraging Assisting Languages for Neural Machine Translation

Song, Haiyue and Dabre, Raj and Mao, Zhuoyuan and Cheng, Fei and Kurohashi, Sadao and Sumita, Eiichiro. Pre-training via Leveraging Assisting Languages for Neural Machine Translation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 2020. doi:10.18653/v1/2020.acl-srw.37

work page doi:10.18653/v1/2020.acl-srw.37 2020
[51]

Transliteration for Cross-Lingual Morphological Inflection

Murikinati, Nikitha and Anastasopoulos, Antonios and Neubig, Graham. Transliteration for Cross-Lingual Morphological Inflection. Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. 2020. doi:10.18653/v1/2020.sigmorphon-1.22

work page doi:10.18653/v1/2020.sigmorphon-1.22 2020
[52]

Language Relatedness and Lexical Closeness can help Improve Multilingual NMT : IITB ombay@ M ulti I ndic NMT WAT 2021

Khatri, Jyotsana and Saini, Nikhil and Bhattacharyya, Pushpak. Language Relatedness and Lexical Closeness can help Improve Multilingual NMT : IITB ombay@ M ulti I ndic NMT WAT 2021. Proceedings of the 8th Workshop on Asian Translation (WAT2021). 2021. doi:10.18653/v1/2021.wat-1.26

work page doi:10.18653/v1/2021.wat-1.26 2021
[53]

A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages

Vania, Clara and Kementchedjhieva, Yova and S gaard, Anders and Lopez, Adam. A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...

work page doi:10.18653/v1/d19-1102 2019
[54]

Exploring the Impact of Transliteration on NLP Performance: Treating M altese as an A rabic Dialect

Micallef, Kurt and Eryani, Fadhl and Habash, Nizar and Bouamor, Houda and Borg, Claudia. Exploring the Impact of Transliteration on NLP Performance: Treating M altese as an A rabic Dialect. Proceedings of the Workshop on Computation and Written Language (CAWL 2023). 2023. doi:10.18653/v1/2023.cawl-1.4

work page doi:10.18653/v1/2023.cawl-1.4 2023
[55]

M ai NLP at S em E val-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness

Zhou, Shijia and Shan, Huangyan and Plank, Barbara and Litschko, Robert. M ai NLP at S em E val-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness. Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024). 2024. doi:10.18653/v1/2024.semeval-1.259

work page doi:10.18653/v1/2024.semeval-1.259 2024
[56]

Cross-lingual Transfer Learning for J apanese Named Entity Recognition

Johnson, Andrew and Karanasou, Penny and Gaspers, Judith and Klakow, Dietrich. Cross-lingual Transfer Learning for J apanese Named Entity Recognition. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers). 2019. doi:10.18653/v1/N19-2023

work page doi:10.18653/v1/n19-2023 2019
[57]

Rijhwani, Shruti and Xie, Jiateng and Neubig, Graham and Carbonell, Jaime , title =. Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence , articleno =. 2019 , isbn =. doi:10.1609/...

work page doi:10.1609/aaai.v33i01.33016924 2019
[58]

A Novel Approach towards Cross Lingual Sentiment Analysis using Transliteration and Character Embedding

Roychoudhury, Rajarshi and Dey, Subhrajit and Akhtar, Md and Das, Amitava and Naskar, Sudip. A Novel Approach towards Cross Lingual Sentiment Analysis using Transliteration and Character Embedding. Proceedings of the 19th International Conference on Natural Language Processing (ICON). 2022

2022
[59]

Multiple Character Embeddings for C hinese Word Segmentation

Zhou, Jianing and Wang, Jingkang and Liu, Gongshen. Multiple Character Embeddings for C hinese Word Segmentation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 2019. doi:10.18653/v1/P19-2029

work page doi:10.18653/v1/p19-2029 2019
[60]

Putting Figures on Influences on M oroccan D arija from A rabic, F rench and S panish using the W ord N et

Mrini, Khalil and Bond, Francis. Putting Figures on Influences on M oroccan D arija from A rabic, F rench and S panish using the W ord N et. Proceedings of the 9th Global Wordnet Conference. 2018

2018
[61]

and Smith, Noah A

Chau, Ethan C. and Smith, Noah A. Specializing Multilingual Language Models: An Empirical Study. Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021. doi:10.18653/v1/2021.mrl-1.5

work page doi:10.18653/v1/2021.mrl-1.5 2021
[62]

Alternative Input Signals Ease Transfer in Multilingual Machine Translation

Sun, Simeng and Fan, Angela and Cross, James and Chaudhary, Vishrav and Tran, Chau and Koehn, Philipp and Guzm \'a n, Francisco. Alternative Input Signals Ease Transfer in Multilingual Machine Translation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.363

work page doi:10.18653/v1/2022.acl-long.363 2022
[63]

T ransli C o: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models

Liu, Yihong and Ma, Chunlan and Ye, Haotian and Schuetze, Hinrich. T ransli C o: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.136

work page doi:10.18653/v1/2024.acl-long.136 2024
[64]

arXiv preprint arXiv:2409.17326 , year=

How transliterations improve crosslingual alignment , author=. arXiv preprint arXiv:2409.17326 , year=

work page arXiv
[65]

Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

Xhelili, Orgest and Liu, Yihong and Schuetze, Hinrich. Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.659

work page doi:10.18653/v1/2024.findings-emnlp.659 2024
[66]

The Effect of Model Capacity and Script Diversity on Subword Tokenization for S orani K urdish

Salehi, Ali and Jacobs, Cassandra L. The Effect of Model Capacity and Script Diversity on Subword Tokenization for S orani K urdish. Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology. 2024. doi:10.18653/v1/2024.sigmorphon-1.6

work page doi:10.18653/v1/2024.sigmorphon-1.6 2024
[67]

S cript M ix: Mixing Scripts for Low-resource Language Parsing

Lee, Jaeseong and Lee, Dohyeon and Hwang, Seung-won. S cript M ix: Mixing Scripts for Low-resource Language Parsing. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.357

work page doi:10.18653/v1/2024.naacl-long.357 2024
[68]

T rans MI : A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

Liu, Yihong and Ma, Chunlan and Ye, Haotian and Sch. T rans MI : A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025
[69]

Input Combination Strategies for Multi-Source Transformer Decoder

Libovick \'y , Jind r ich and Helcl, Jind r ich and Mare c ek, David. Input Combination Strategies for Multi-Source Transformer Decoder. Proceedings of the Third Conference on Machine Translation: Research Papers. 2018. doi:10.18653/v1/W18-6326

work page doi:10.18653/v1/w18-6326 2018
[70]

Emerging Cross-lingual Structure in Pretrained Language Models

Conneau, Alexis and Wu, Shijie and Li, Haoran and Zettlemoyer, Luke and Stoyanov, Veselin. Emerging Cross-lingual Structure in Pretrained Language Models. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.536

work page doi:10.18653/v1/2020.acl-main.536 2020
[71]

Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

Lost in Transliteration: Bridging the Script Gap in Neural IR , author=. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=
[72]

and Krikun, Maxim and Wu, Yonghui and Chen, Zhifeng and Thorat, Nikhil and Viégas, Fernanda and Wattenberg, Martin and Corrado, Greg , year =

Johnson, Melvin and Schuster, Mike and Le, Quoc V. and Krikun, Maxim and Wu, Yonghui and Chen, Zhifeng and Thorat, Nikhil and Vi \'e gas, Fernanda and Wattenberg, Martin and Corrado, Greg and Hughes, Macduff and Dean, Jeffrey. G oogle ' s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Co...

work page doi:10.1162/tacl_a_00065 2017
[73]

civilizing

Zoph, Barret and Yuret, Deniz and May, Jonathan and Knight, Kevin. Transfer Learning for Low-Resource Neural Machine Translation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1163

work page doi:10.18653/v1/d16-1163 2016
[74]

Cross-lingual Language Model Pretraining , url =

Conneau, Alexis and Lample, Guillaume , booktitle =. Cross-lingual Language Model Pretraining , url =
[75]

Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

Jung, Haeji and Kim, Jinju and Kim, Kyungjin and Roh, Youjeong and Mortensen, David R. Happiness is Sharing a Vocabulary: A Study of Transliteration Methods. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.365

work page doi:10.18653/v1/2026.eacl-long.365 2026
[76]

Towards a Common Understanding of Contributing Factors for Cross-Lingual Transfer in Multilingual Language Models: A Review

Philippy, Fred and Guo, Siwen and Haddadan, Shohreh. Towards a Common Understanding of Contributing Factors for Cross-Lingual Transfer in Multilingual Language Models: A Review. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.323

work page doi:10.18653/v1/2023.acl-long.323 2023
[77]

The Decades Progress on Code-Switching Research in NLP : A Systematic Survey on Trends and Challenges

Winata, Genta and Aji, Alham Fikri and Yong, Zheng Xin and Solorio, Thamar. The Decades Progress on Code-Switching Research in NLP : A Systematic Survey on Trends and Challenges. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.185

work page doi:10.18653/v1/2023.findings-acl.185 2023
[78]

2026 , eprint=

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models across Modalities , author=. 2026 , eprint=

2026