pith. sign in

arxiv: 2605.22099 · v1 · pith:HBGJBHJBnew · submitted 2026-05-21 · 💻 cs.CL

A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

Pith reviewed 2026-05-22 06:36 UTC · model grok-4.3

classification 💻 cs.CL
keywords KhmerRetrieval-Augmented GenerationRAGdense retrievalquestion answeringlanguage modelslow-resource languagestelecom domain
0
0 comments X

The pith

BGE-M3 retrieves Khmer documents most effectively in RAG while no generator model leads on every quality metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates retrieval-augmented generation for question answering over Khmer-language telecom documents. It first compares three embedding models on dense retrieval and identifies BGE-M3 as the strongest performer across hit rate, file hit rate, MRR, and precision at top-3. It then fixes BGE-M3 as retriever and tests five generator models on a set of 200 curated question-answer pairs using six RAGAS-style metrics. Different generators prove strongest on different axes, with one leading faithfulness and context relevance, another factual correctness, and a third answer relevance, similarity, and correctness. A reader would care because the results identify concrete component choices that affect how reliably a RAG system can ground answers in retrieved evidence for a low-resource, non-Latin-script language.

Core claim

BGE-M3 achieves the highest retrieval scores with Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112 over Khmer documents. When this retriever is paired with each of five generators, Qwen3.5-9B scores highest on faithfulness and context relevance, Qwen3-8B on factual correctness, and SeaLLMs-v3-7B-Chat on answer relevance, answer similarity, and answer correctness.

What carries the argument

Two-phase benchmarking that first selects the best dense retriever among three embedding models and then measures five generator models against six RAGAS-inspired metrics on a fixed retriever.

If this is right

  • Retriever choice remains the larger performance bottleneck for Khmer RAG systems.
  • Generator selection should be driven by which metric matters most for a given use case.
  • Multilingual embedding models trained across many languages transfer better to Khmer retrieval than the alternatives tested.
  • Practical deployments will need to trade off among faithfulness, factual correctness, and semantic similarity rather than expect one model to optimize all three.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar side-by-side tests could identify reliable component combinations for other low-resource languages that use non-Latin scripts.
  • Hybrid systems that route queries to the generator strongest on the needed metric might improve end-to-end results without retraining.
  • The observed variation across metrics suggests that automatic metric weighting or learned reranking could be a useful next step.

Load-bearing premise

The curated golden dataset of 200 Khmer question-answer pairs is representative and free of selection bias for evaluating real-world performance in the telecom domain.

What would settle it

Repeating the full evaluation on an independently collected set of several hundred new Khmer questions drawn from actual telecom support logs or forums would show whether the same model rankings hold.

Figures

Figures reproduced from arXiv: 2605.22099 by Kimleang Ly, Phannet Pov, Ratanaktepi Chhor, Saksonita Khoeurn, Sereiwathna Ros, Wan-Sup Cho.

Figure 1
Figure 1. Figure 1: System architecture of the RAG pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator backends: Qwen3 (8B), Qwen3.5 (9B), Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, and Llama-SEA-LION-v2-8B-IT, on a curated golden dataset of 200 Khmer question-answer pairs. To quantify system performance, we apply six RAGAS-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness. The results show no single model dominates across all metrics: Qwen3.5-9B achieves the highest faithfulness (0.859) and context relevance (0.726), Qwen3-8B attains the highest factual correctness (0.380), and SeaLLMs-v3-7B-Chat performs best on answer relevance (0.867), answer similarity (0.836), and answer correctness (0.599). These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents a two-phase empirical evaluation of RAG for Khmer telecom-domain QA. It benchmarks three dense retrievers (BGE-M3, Jina-Embeddings-v3, Qwen3-Embedding) over Khmer documents and finds BGE-M3 strongest (Hit Rate@3 = 0.285). Using BGE-M3 as retriever, it then compares five generators (Qwen3-8B, Qwen3.5-9B, Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, Llama-SEA-LION-v2-8B-IT) on a curated set of 200 Khmer QA pairs using six RAGAS-inspired metrics, reporting that no single generator dominates.

Significance. If the empirical results hold, the work supplies one of the first systematic comparisons of modern embedding and generation models for RAG in Khmer, a low-resource non-Latin-script language. The observation that retrieval performance remains the primary bottleneck while generator strengths are metric-dependent is actionable for practitioners working on similar Southeast-Asian languages. The use of direct held-out evaluation with standard metrics is a methodological strength.

major comments (1)
  1. [Abstract] Abstract and dataset-construction section: all reported retrieval and generation metrics are computed exclusively on one fixed set of 200 curated Khmer QA pairs. The manuscript provides no information on sourcing of the questions, generation protocol, expert validation steps, topic coverage, linguistic diversity, or difficulty distribution. Because the central comparative claims (BGE-M3 superiority and absence of a dominant generator) rest entirely on performance differences observed on this set, the lack of documentation leaves the results vulnerable to selection bias and limits their generalizability.
minor comments (2)
  1. [Abstract] The abstract states that six 'RAGAS-inspired' metrics are used but does not specify which exact RAGAS implementations or any Khmer-specific adaptations (e.g., tokenization or embedding choices for faithfulness scoring) were employed.
  2. Retrieval results would be easier to interpret if the total number of documents in the corpus and the average document length were reported alongside the Hit Rate and MRR figures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding dataset documentation below and will incorporate the necessary revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and dataset-construction section: all reported retrieval and generation metrics are computed exclusively on one fixed set of 200 curated Khmer QA pairs. The manuscript provides no information on sourcing of the questions, generation protocol, expert validation steps, topic coverage, linguistic diversity, or difficulty distribution. Because the central comparative claims (BGE-M3 superiority and absence of a dominant generator) rest entirely on performance differences observed on this set, the lack of documentation leaves the results vulnerable to selection bias and limits their generalizability.

    Authors: We agree that the manuscript would benefit from expanded documentation of the 200 QA pairs to better support the comparative claims and address potential concerns about selection bias. In the revised version, we will add a dedicated subsection detailing the dataset construction process, including the sourcing of questions from authentic Khmer telecom-domain sources, the generation protocol, expert validation procedures, topic coverage across key telecom areas, linguistic diversity considerations, and the distribution of question difficulty levels. This will improve transparency and help readers assess the generalizability of the findings to other low-resource Southeast Asian languages. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmarks

full rationale

The paper reports direct empirical measurements of retrieval metrics (Hit Rate@3, MRR@3, etc.) for three embedding models and six RAGAS-inspired metrics for five generators on a fixed set of 200 curated Khmer QA pairs. No equations, first-principles derivations, or predictions are presented that reduce to fitted parameters or self-referential definitions by construction. Central claims rest on straightforward benchmark results rather than any load-bearing self-citation chain or ansatz smuggled via prior work. The evaluation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on pre-trained models and standard RAG evaluation practices from prior work; no new free parameters, axioms, or invented entities are introduced.

axioms (1)
  • domain assumption Standard RAGAS-style metrics faithfully capture faithfulness, relevance, and correctness for Khmer text.
    Invoked when reporting the six metrics as quantifiers of system performance.

pith-pipeline@v0.9.0 · 5955 in / 993 out tokens · 47946 ms · 2026-05-22T06:36:58.071849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

  1. [1]

    uttler, Mike Lewis, Wen tau Yih, Tim Rockt

    PatrickLewis, EthanPerez, AleksandraPiktus, FabioPetroni, VladimirKarpukhin, Naman Goyal, Heinrich K"uttler, Mike Lewis, Wen tau Yih, Tim Rockt"aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459– 9474, Red Hook, N...

  2. [2]

    Retrieval-augmented generation for large language models: A survey, 2024

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024

  3. [3]

    Ragas: Automated evaluation of retrieval augmented generation

    Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstra- tions, pages 150–158, St. Julian’s, Malta, 2024. Association for Computational Linguistics

  4. [4]

    Evaluation of RAG metrics for question answering in the telecom domain

    Sujoy Roychowdhury, Sumit Soman, H G Ranjani, Neeraj Gunda, Vansh Chhabra, and Sai Krishna Bala. Evaluation of RAG metrics for question answering in the telecom domain. InICML 2024 Workshop on Foundation Models in the Wild, pages 1–7, Vienna, Austria,

  5. [5]

    arXiv:2407.12873

    PMLR. arXiv:2407.12873

  6. [6]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

  7. [7]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, 2020. Association for Computational Linguistics

  8. [8]

    BGE M3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE M3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

  9. [9]

    Advancing retrieval-augmented generation for persian: Development of language models, comprehensive benchmarks, and best prac- tices for optimization, 2025

    Sara Bourbour Hosseinbeigi, Sina Asghari, Mohammad Ali Seif Kashani, Mohammad Hos- sein Shalchian, and Mohammad Amin Abbasi. Advancing retrieval-augmented generation for persian: Development of language models, comprehensive benchmarks, and best prac- tices for optimization, 2025

  10. [10]

    BLEU: A method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the ACL, pages 311–318, Philadelphia, PA, USA, 2002. Association for Computational Linguistics

  11. [11]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summa- rization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computational Linguistics. 11

  12. [12]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT, 2019

  13. [13]

    G-eval: Nlg evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, 2023. Association for Computational Linguistics

  14. [14]

    Ragas: Retrieval augmented generation assessment.https://gith ub.com/vibrantlabsai/ragas, 2026

    Exploding Gradients. Ragas: Retrieval augmented generation assessment.https://gith ub.com/vibrantlabsai/ragas, 2026. GitHub repository, accessed 2026-03-12

  15. [15]

    Benchmarking large language models in retrieval-augmented generation

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762, Vancouver, Canada, 2024. AAAI Press

  16. [16]

    Sea-lion: Southeast asian languages in one network, 2025

    Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Hamsaward- hini Rengarajan, Peerat Limkonchotiwat, Adithya Venkatadri Hulagadri, Kok Wai Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao Ong, Jann Railey Montala...

  17. [17]

    Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min Lin

    Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Ky- dlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Niratti- sai Thongchim, T...

  18. [18]

    Fine-tuning for question answering in low-resource languages: A case study on khmer

    Kimleang Ly, Dona Valy, and Phutphalla Kong. Fine-tuning for question answering in low-resource languages: A case study on khmer. In2024 17th International Congress on Advanced Applied Informatics (IIAI-AAI-Winter), pages 162–165, Kitakyushu, Japan,

  19. [19]

    Ollama: Run large language models locally.https://ollama.com, 2024

    Ollama. Ollama: Run large language models locally.https://ollama.com, 2024

  20. [20]

    jina-embeddings-v3: Multilingual embeddings with task lora, 2024

    SabaSturua, IsabelleMohr, MohammadKalimAkram, MichaelGünther, BoWang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, and Han Xiao. jina-embeddings-v3: Multilingual embeddings with task lora, 2024

  21. [21]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

  22. [22]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  23. [23]

    Qwen3.5: Accelerating productivity with native multimodal agents, February

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

  24. [24]

    URLhttps://qwen.ai/blog?id=qwen3.5. 12

  25. [25]

    Seallms 3: Open foundation and chat multilingual large language models for southeast asian lan- guages, 2024

    Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li, and Lidong Bing. Seallms 3: Open foundation and chat multilingual large language models for southeast asian lan- guages, 2024

  26. [26]

    GPT-4o mini.https://openai.com/index/gpt-4o-mini-advancing-cost-eff icient-intelligence, 2024

    OpenAI. GPT-4o mini.https://openai.com/index/gpt-4o-mini-advancing-cost-eff icient-intelligence, 2024

  27. [27]

    Given a question and answer, create one or more statements from each sentence in the given answer

    HaraldSteck, ChaitanyaEkanadham, andNathanKallus. Iscosine-similarityofembeddings really about similarity? InCompanion Proceedings of the ACM on Web Conference 2024, pages 887–890, New York, NY, USA, 2024. Association for Computing Machinery. A Computation of RAGAS Metrics We refer the reader to Es et al. [3] and Roychowdhury et al. [4] for details on the...

  28. [28]

    The code used to unsubscribe from mobile supplementary services (VAS) iscode *1200#

  29. [29]

    Code *1200# can be used without requiring balance top-up actions or identity verification

  30. [30]

    To unsubscribe from supplementary services, the user dials *1200# and presses send

  31. [31]

    Verification verdicts:Statement 1: Yes; Statement 2: Yes; Statement 3: Yes; Statement 4: Yes

    Unsubscribing from supplementary services does not require checking balance or contacting customer service. Verification verdicts:Statement 1: Yes; Statement 2: Yes; Statement 3: Yes; Statement 4: Yes. Computation:supported statements= 4, total statements= 4, thereforeFaiFul= 4/4 = 1.000. Answer relevance Three reverse questions were generated from the an...

  32. [32]

    False positives (FP):

    Code *1200# can be used to unsubscribe from unwanted mobile supplementary services. False positives (FP):

  33. [33]

    The code to unsubscribe from mobile supplementary services (VAS) without needing a balance check, subscriber identity verification, top-up, or customer-service contact iscode *1200#

  34. [34]

    False negatives (FN):None

    By dialing *1200# and pressing send, users can unsubscribe from supplementary services without balance top-up or identity-verification actions, as announced in the public notice. False negatives (FN):None. Computation:TP= 1, FP= 2, FN= 0; precision= 1/(1 + 2) = 1/3, recall= 1/(1 + 0) = 1, and F1 = 2×(1/3)×1/(1/3 + 1) = 0.500. Answer similarity / correctne...