pith. sign in

arxiv: 2507.08480 · v2 · pith:7WAOPZ7Onew · submitted 2025-07-11 · 💻 cs.IR

Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging

Pith reviewed 2026-05-22 00:35 UTC · model grok-4.3

classification 💻 cs.IR
keywords cross-lingual information retrievalKorean-English retrievalmodel mergingtraining data compositionmonolingual IRCLIR performancelanguage pairsinformation retrieval
0
0 comments X

The pith

Training data language composition creates a trade-off between cross-lingual and monolingual retrieval performance that model merging can resolve.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how the mix of Korean and English in training data affects both cross-lingual information retrieval and monolingual retrieval. Experiments with parallel datasets across different language combinations reveal that certain mixes improve cross-lingual results while degrading monolingual performance due to inter-lingual correlations. The authors show that merging models trained on separate compositions can deliver strong performance on both tasks at once. This data-centric approach matters for building retrieval systems that handle multilingual queries without losing accuracy on single-language searches.

Core claim

The language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations: CLIR performance improves with specific language pairs, while Mono-Lingual IR performance declines. Model merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities.

What carries the argument

Model merging applied across retrieval models trained on different language compositions of linguistically parallel Korean-English datasets.

If this is right

  • Specific language pair combinations in training data enhance CLIR performance.
  • Optimizing for those CLIR gains causes measurable declines in monolingual IR.
  • Model merging across differently composed models balances both objectives without major compromise.
  • Linguistic configuration of training data directly shapes outcomes in both CLIR and monolingual tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same data-composition effects may appear in other language pairs and could be addressed with merging strategies.
  • Practitioners might use model merging as a default post-training step when scaling multilingual retrieval systems.
  • Automated selection of language mixes could become a new hyperparameter in retrieval model development.

Load-bearing premise

The constructed linguistically parallel Korean-English datasets are representative enough of real usage patterns that the measured performance differences will hold for other data sources and model architectures.

What would settle it

Retraining the same language compositions on an independent Korean-English corpus or different retrieval model architecture that shows no trade-off between CLIR gains and monolingual losses would falsify the inter-lingual correlation claim.

read the original abstract

With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on both CLIR and Mono-Lingual Information Retrieval (IR) performance remains under-explored. To systematically investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations: CLIR performance improves with specific language pairs, while Mono-Lingual IR performance declines. Our work demonstrates that Model Merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities. Our findings underscore the effects of linguistic configuration of training data on both CLIR and Mono-Lingual IR, and present Model Merging as a viable strategy to optimize performance across these tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper constructs linguistically parallel Korean-English datasets and trains retrieval models on various language combinations to study the effects of training data composition on CLIR and mono-lingual IR performance. It reports that specific language mixes improve CLIR while degrading mono-lingual IR due to inter-lingual correlations, and demonstrates that model merging can mitigate this trade-off to achieve strong results on both tasks.

Significance. If the empirical findings hold after addressing potential confounds, the work provides concrete data-centric evidence on how language composition in training data creates performance trade-offs in multilingual IR, along with a practical mitigation via model merging. This could inform training strategies for low-resource cross-lingual retrieval and highlights the value of controlled parallel dataset construction for isolating linguistic factors.

major comments (1)
  1. [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central claim that language composition drives the observed CLIR gains and mono-IR declines requires that total training instance counts are held constant across conditions. The manuscript describes training on 'various language combinations' of parallel data but does not explicitly confirm matched sample sizes; without this control, the inter-lingual correlations could be confounded by differences in data volume or dilution effects rather than linguistic factors alone.
minor comments (3)
  1. [§3] Clarify the exact construction process for the linguistically parallel Korean-English datasets, including any filtering or alignment steps, to allow replication.
  2. [§5] Report statistical significance or variance across multiple runs for the performance differences attributed to language pairs.
  3. [§6] Add a brief discussion of how the model merging procedure (e.g., specific merging technique and hyperparameters) was selected and whether it generalizes beyond the tested architectures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for this constructive comment on our experimental controls. We address the concern regarding matched training instance counts below and will update the manuscript to make the relevant details explicit.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central claim that language composition drives the observed CLIR gains and mono-IR declines requires that total training instance counts are held constant across conditions. The manuscript describes training on 'various language combinations' of parallel data but does not explicitly confirm matched sample sizes; without this control, the inter-lingual correlations could be confounded by differences in data volume or dilution effects rather than linguistic factors alone.

    Authors: We appreciate the referee's emphasis on this control. In our experiments, the total number of training instances was held constant across all language composition conditions through stratified subsampling of the parallel Korean-English corpus. For mixed-language conditions, we drew equal numbers of instances from each language to match the aggregate count used in monolingual conditions. We will revise §4 to explicitly describe this sampling procedure and reference it when discussing results in §5, thereby strengthening the isolation of linguistic factors from data-volume effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements from controlled training runs

full rationale

The paper reports results from constructing parallel Korean-English datasets and training retrieval models on different language combinations, then measuring CLIR and mono-IR performance. No equations, fitted parameters, or self-referential definitions appear in the provided text; the inter-lingual correlations and model-merging mitigation are presented as direct experimental outcomes rather than quantities derived from themselves. Potential concerns about holding total training volume constant are methodological (possible confound) but do not reduce any claimed result to an input by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study that relies on standard information-retrieval assumptions rather than introducing new theoretical constructs.

axioms (1)
  • domain assumption Standard IR evaluation protocols and retrieval model training procedures produce reliable performance signals.
    The reported performance differences presuppose that common metrics and training setups are valid for the constructed datasets.

pith-pipeline@v0.9.0 · 5718 in / 1117 out tokens · 50215 ms · 2026-05-22T00:35:49.147899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 7 internal anchors

  1. [1]

    Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2023. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884

  2. [2]

    Paheli Bhattacharya, Pawan Goyal, and Sudeshna Sarkar. 2016. Query translation for cross-language information retrieval using multilingual word clusters. In Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), pages 152--162

  3. [3]

    Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21--29. IEEE

  4. [4]

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.137 M 3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation . In Findings of the Association for Computational Linguistics: ACL 2024, pages 2318--2335, Bangkok,...

  5. [5]

    Zhuyun Dai and Jamie Callan. 2020. Context-aware term weighting for first stage passage retrieval. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 1533--1536

  6. [6]

    Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and St \'e phane Clinchant. 2021 a . Splade v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086

  7. [7]

    Thibault Formal, Benjamin Piwowarski, and St \'e phane Clinchant. 2021 b . Splade: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2288--2292

  8. [8]

    Jianfeng Gao, Jian-Yun Nie, Endong Xun, Jian Zhang, Ming Zhou, and Changning Huang. 2001. Improving query translation for cross-language information retrieval using statistical models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 96--104

  9. [9]

    Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. 2021. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983

  10. [10]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2:1

  11. [11]

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346--361

  12. [12]

    Koustava Goswami, Sourav Dutta, Haytham Assem, Theodorus Fransen, and John Philip McCrae. 2021. Cross-lingual sentence embedding using multi-task learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9099--9113

  13. [13]

    Zhiqi Huang, Puxuan Yu, and James Allan. 2023. Improving cross-lingual information retrieval on low-resource languages via optimal transport distillation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 1048--1056

  14. [14]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769--6781

  15. [15]

    Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39--48

  16. [16]

    Dawn Lawrie, Eugene Yang, Douglas W Oard, and James Mayfield. 2023. Neural approaches to multilingual information retrieval. In European Conference on Information Retrieval, pages 521--536. Springer

  17. [17]

    Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hern \'a ndez \'A brego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, and 1 others. 2025. Gemini embedding: Generalizable embeddings from gemini. arXiv preprint arXiv:2503.07891

  18. [18]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, and 1 others. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459--9474

  19. [19]

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025 a . Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366

  20. [20]

    Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, and Zhicheng Dou. 2025 b . From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems, 43(3):1--62

  21. [21]

    Robert Litschko, Ivan Vuli \'c , Simone Paolo Ponzetto, and Goran Glava s . 2022. On cross-lingual retrieval with multilingual text encoders. Information Retrieval Journal, 25(2):149--183

  22. [22]

    Qi Liu and Jiaxin Mao. 2023. Understanding the multi-vector dense retrieval models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4110--4114

  23. [23]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

  24. [24]

    Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703--17716

  25. [25]

    Gabriel de Souza P Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge. 2024. Nv-retriever: Improving text embedding models with effective hard-negative mining. arXiv preprint arXiv:2407.15831

  26. [26]

    Toan Ngoc Nguyen, Nam Le Hai, Nguyen Doan Hieu, Dai An Nguyen, Linh Ngo Van, Thien Huu Nguyen, and Sang Dinh. 2025. Improving vietnamese-english cross-lingual retrieval for legal and general domains. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...

  27. [27]

    Dong-Geun Oh. 2020. Korean decimal classification (kdc). ISKO Encyclopedia of Knowledge Organization

  28. [28]

    OpenAI . 2024. Openai api. https://platform.openai.com. Accessed: 2024-06-30

  29. [29]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084

  30. [30]

    Shadi Saleh and Pavel Pecina. 2020. Document translation vs. query translation for cross-lingual information retrieval in the medical domain. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6849--6860

  31. [31]

    Mona L Scott and MONA L SCOTT. 1998. Dewey decimal classification. Libraries Unlimited

  32. [32]

    Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. Agentic retrieval-augmented generation: A survey on agentic rag. arXiv preprint arXiv:2501.09136

  33. [33]

    Amit Singhal and 1 others. 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35--43

  34. [34]

    Aivin V Solatorio. 2024. Gistembed: Guided in-sample selection of training negatives for text embedding fine-tuning. arXiv preprint arXiv:2402.16829

  35. [35]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368

  36. [36]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672

  37. [37]

    Yau-Shian Wang, Ashley Wu, and Graham Neubig. 2022. English contrastive learning can learn universal cross-lingual sentence embeddings. arXiv preprint arXiv:2211.06127

  38. [38]

    Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pages 25--54. PMLR

  39. [39]

    Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and 1 others. 2022. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, pages ...

  40. [40]

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808

  41. [41]

    Eugene Yang, Thomas J \"a nich, James Mayfield, and Dawn Lawrie. 2024. Language fairness in multilingual information retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2487--2491

  42. [42]

    Puxuan Yu, Hongliang Fei, and Ping Li. 2021. Cross-lingual language model pretraining for retrieval. In Proceedings of the Web Conference 2021, pages 1029--1039

  43. [43]

    Fuwei Zhang, Zhao Zhang, Xiang Ao, Dehong Gao, Fuzhen Zhuang, Yi Wei, and Qing He. 2022. Mind the gap: Cross-lingual information retrieval with hierarchical knowledge enhancement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 4345--4353

  44. [44]

    Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, and 1 others. 2024. mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. arXiv preprint arXiv:2407.19669

  45. [45]

    Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. 2024. Dense text retrieval based on pretrained language models: A survey. ACM Transactions on Information Systems, 42(4):1--60

  46. [46]

    Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. 2023. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107

  47. [47]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  48. [48]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...