Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging
Pith reviewed 2026-05-22 00:35 UTC · model grok-4.3
The pith
Training data language composition creates a trade-off between cross-lingual and monolingual retrieval performance that model merging can resolve.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations: CLIR performance improves with specific language pairs, while Mono-Lingual IR performance declines. Model merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities.
What carries the argument
Model merging applied across retrieval models trained on different language compositions of linguistically parallel Korean-English datasets.
If this is right
- Specific language pair combinations in training data enhance CLIR performance.
- Optimizing for those CLIR gains causes measurable declines in monolingual IR.
- Model merging across differently composed models balances both objectives without major compromise.
- Linguistic configuration of training data directly shapes outcomes in both CLIR and monolingual tasks.
Where Pith is reading between the lines
- The same data-composition effects may appear in other language pairs and could be addressed with merging strategies.
- Practitioners might use model merging as a default post-training step when scaling multilingual retrieval systems.
- Automated selection of language mixes could become a new hyperparameter in retrieval model development.
Load-bearing premise
The constructed linguistically parallel Korean-English datasets are representative enough of real usage patterns that the measured performance differences will hold for other data sources and model architectures.
What would settle it
Retraining the same language compositions on an independent Korean-English corpus or different retrieval model architecture that shows no trade-off between CLIR gains and monolingual losses would falsify the inter-lingual correlation claim.
read the original abstract
With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on both CLIR and Mono-Lingual Information Retrieval (IR) performance remains under-explored. To systematically investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations: CLIR performance improves with specific language pairs, while Mono-Lingual IR performance declines. Our work demonstrates that Model Merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities. Our findings underscore the effects of linguistic configuration of training data on both CLIR and Mono-Lingual IR, and present Model Merging as a viable strategy to optimize performance across these tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper constructs linguistically parallel Korean-English datasets and trains retrieval models on various language combinations to study the effects of training data composition on CLIR and mono-lingual IR performance. It reports that specific language mixes improve CLIR while degrading mono-lingual IR due to inter-lingual correlations, and demonstrates that model merging can mitigate this trade-off to achieve strong results on both tasks.
Significance. If the empirical findings hold after addressing potential confounds, the work provides concrete data-centric evidence on how language composition in training data creates performance trade-offs in multilingual IR, along with a practical mitigation via model merging. This could inform training strategies for low-resource cross-lingual retrieval and highlights the value of controlled parallel dataset construction for isolating linguistic factors.
major comments (1)
- [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central claim that language composition drives the observed CLIR gains and mono-IR declines requires that total training instance counts are held constant across conditions. The manuscript describes training on 'various language combinations' of parallel data but does not explicitly confirm matched sample sizes; without this control, the inter-lingual correlations could be confounded by differences in data volume or dilution effects rather than linguistic factors alone.
minor comments (3)
- [§3] Clarify the exact construction process for the linguistically parallel Korean-English datasets, including any filtering or alignment steps, to allow replication.
- [§5] Report statistical significance or variance across multiple runs for the performance differences attributed to language pairs.
- [§6] Add a brief discussion of how the model merging procedure (e.g., specific merging technique and hyperparameters) was selected and whether it generalizes beyond the tested architectures.
Simulated Author's Rebuttal
We thank the referee for this constructive comment on our experimental controls. We address the concern regarding matched training instance counts below and will update the manuscript to make the relevant details explicit.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central claim that language composition drives the observed CLIR gains and mono-IR declines requires that total training instance counts are held constant across conditions. The manuscript describes training on 'various language combinations' of parallel data but does not explicitly confirm matched sample sizes; without this control, the inter-lingual correlations could be confounded by differences in data volume or dilution effects rather than linguistic factors alone.
Authors: We appreciate the referee's emphasis on this control. In our experiments, the total number of training instances was held constant across all language composition conditions through stratified subsampling of the parallel Korean-English corpus. For mixed-language conditions, we drew equal numbers of instances from each language to match the aggregate count used in monolingual conditions. We will revise §4 to explicitly describe this sampling procedure and reference it when discussing results in §5, thereby strengthening the isolation of linguistic factors from data-volume effects. revision: yes
Circularity Check
No circularity: empirical measurements from controlled training runs
full rationale
The paper reports results from constructing parallel Korean-English datasets and training retrieval models on different language combinations, then measuring CLIR and mono-IR performance. No equations, fitted parameters, or self-referential definitions appear in the provided text; the inter-lingual correlations and model-merging mitigation are presented as direct experimental outcomes rather than quantities derived from themselves. Potential concerns about holding total training volume constant are methodological (possible confound) but do not reduce any claimed result to an input by construction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard IR evaluation protocols and retrieval model training procedures produce reliable performance signals.
Reference graph
Works this paper leans on
-
[1]
Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2023. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884
-
[2]
Paheli Bhattacharya, Pawan Goyal, and Sudeshna Sarkar. 2016. Query translation for cross-language information retrieval using multilingual word clusters. In Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), pages 152--162
work page 2016
-
[3]
Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21--29. IEEE
work page 1997
-
[4]
Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.137 M 3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation . In Findings of the Association for Computational Linguistics: ACL 2024, pages 2318--2335, Bangkok,...
-
[5]
Zhuyun Dai and Jamie Callan. 2020. Context-aware term weighting for first stage passage retrieval. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 1533--1536
work page 2020
- [6]
-
[7]
Thibault Formal, Benjamin Piwowarski, and St \'e phane Clinchant. 2021 b . Splade: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2288--2292
work page 2021
-
[8]
Jianfeng Gao, Jian-Yun Nie, Endong Xun, Jian Zhang, Ming Zhou, and Changning Huang. 2001. Improving query translation for cross-language information retrieval using statistical models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 96--104
work page 2001
- [9]
-
[10]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2:1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346--361
work page 2021
-
[12]
Koustava Goswami, Sourav Dutta, Haytham Assem, Theodorus Fransen, and John Philip McCrae. 2021. Cross-lingual sentence embedding using multi-task learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9099--9113
work page 2021
-
[13]
Zhiqi Huang, Puxuan Yu, and James Allan. 2023. Improving cross-lingual information retrieval on low-resource languages via optimal transport distillation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 1048--1056
work page 2023
-
[14]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769--6781
work page 2020
-
[15]
Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39--48
work page 2020
-
[16]
Dawn Lawrie, Eugene Yang, Douglas W Oard, and James Mayfield. 2023. Neural approaches to multilingual information retrieval. In European Conference on Information Retrieval, pages 521--536. Springer
work page 2023
-
[17]
Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hern \'a ndez \'A brego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, and 1 others. 2025. Gemini embedding: Generalizable embeddings from gemini. arXiv preprint arXiv:2503.07891
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, and 1 others. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459--9474
work page 2020
-
[19]
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025 a . Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, and Zhicheng Dou. 2025 b . From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems, 43(3):1--62
work page 2025
-
[21]
Robert Litschko, Ivan Vuli \'c , Simone Paolo Ponzetto, and Goran Glava s . 2022. On cross-lingual retrieval with multilingual text encoders. Information Retrieval Journal, 25(2):149--183
work page 2022
-
[22]
Qi Liu and Jiaxin Mao. 2023. Understanding the multi-vector dense retrieval models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4110--4114
work page 2023
-
[23]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703--17716
work page 2022
- [25]
-
[26]
Toan Ngoc Nguyen, Nam Le Hai, Nguyen Doan Hieu, Dai An Nguyen, Linh Ngo Van, Thien Huu Nguyen, and Sang Dinh. 2025. Improving vietnamese-english cross-lingual retrieval for legal and general domains. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...
work page 2025
-
[27]
Dong-Geun Oh. 2020. Korean decimal classification (kdc). ISKO Encyclopedia of Knowledge Organization
work page 2020
-
[28]
OpenAI . 2024. Openai api. https://platform.openai.com. Accessed: 2024-06-30
work page 2024
-
[29]
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[30]
Shadi Saleh and Pavel Pecina. 2020. Document translation vs. query translation for cross-lingual information retrieval in the medical domain. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6849--6860
work page 2020
-
[31]
Mona L Scott and MONA L SCOTT. 1998. Dewey decimal classification. Libraries Unlimited
work page 1998
-
[32]
Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. Agentic retrieval-augmented generation: A survey on agentic rag. arXiv preprint arXiv:2501.09136
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Amit Singhal and 1 others. 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35--43
work page 2001
- [34]
- [35]
-
[36]
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [37]
-
[38]
Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pages 25--54. PMLR
work page 2013
-
[39]
Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and 1 others. 2022. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, pages ...
work page 2022
- [40]
-
[41]
Eugene Yang, Thomas J \"a nich, James Mayfield, and Dawn Lawrie. 2024. Language fairness in multilingual information retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2487--2491
work page 2024
-
[42]
Puxuan Yu, Hongliang Fei, and Ping Li. 2021. Cross-lingual language model pretraining for retrieval. In Proceedings of the Web Conference 2021, pages 1029--1039
work page 2021
-
[43]
Fuwei Zhang, Zhao Zhang, Xiang Ao, Dehong Gao, Fuzhen Zhuang, Yi Wei, and Qing He. 2022. Mind the gap: Cross-lingual information retrieval with hierarchical knowledge enhancement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 4345--4353
work page 2022
- [44]
-
[45]
Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. 2024. Dense text retrieval based on pretrained language models: A survey. ACM Transactions on Information Systems, 42(4):1--60
work page 2024
- [46]
-
[47]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[48]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.