A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering
Pith reviewed 2026-05-22 06:36 UTC · model grok-4.3
The pith
BGE-M3 retrieves Khmer documents most effectively in RAG while no generator model leads on every quality metric.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BGE-M3 achieves the highest retrieval scores with Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112 over Khmer documents. When this retriever is paired with each of five generators, Qwen3.5-9B scores highest on faithfulness and context relevance, Qwen3-8B on factual correctness, and SeaLLMs-v3-7B-Chat on answer relevance, answer similarity, and answer correctness.
What carries the argument
Two-phase benchmarking that first selects the best dense retriever among three embedding models and then measures five generator models against six RAGAS-inspired metrics on a fixed retriever.
If this is right
- Retriever choice remains the larger performance bottleneck for Khmer RAG systems.
- Generator selection should be driven by which metric matters most for a given use case.
- Multilingual embedding models trained across many languages transfer better to Khmer retrieval than the alternatives tested.
- Practical deployments will need to trade off among faithfulness, factual correctness, and semantic similarity rather than expect one model to optimize all three.
Where Pith is reading between the lines
- Similar side-by-side tests could identify reliable component combinations for other low-resource languages that use non-Latin scripts.
- Hybrid systems that route queries to the generator strongest on the needed metric might improve end-to-end results without retraining.
- The observed variation across metrics suggests that automatic metric weighting or learned reranking could be a useful next step.
Load-bearing premise
The curated golden dataset of 200 Khmer question-answer pairs is representative and free of selection bias for evaluating real-world performance in the telecom domain.
What would settle it
Repeating the full evaluation on an independently collected set of several hundred new Khmer questions drawn from actual telecom support logs or forums would show whether the same model rankings hold.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator backends: Qwen3 (8B), Qwen3.5 (9B), Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, and Llama-SEA-LION-v2-8B-IT, on a curated golden dataset of 200 Khmer question-answer pairs. To quantify system performance, we apply six RAGAS-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness. The results show no single model dominates across all metrics: Qwen3.5-9B achieves the highest faithfulness (0.859) and context relevance (0.726), Qwen3-8B attains the highest factual correctness (0.380), and SeaLLMs-v3-7B-Chat performs best on answer relevance (0.867), answer similarity (0.836), and answer correctness (0.599). These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a two-phase empirical evaluation of RAG for Khmer telecom-domain QA. It benchmarks three dense retrievers (BGE-M3, Jina-Embeddings-v3, Qwen3-Embedding) over Khmer documents and finds BGE-M3 strongest (Hit Rate@3 = 0.285). Using BGE-M3 as retriever, it then compares five generators (Qwen3-8B, Qwen3.5-9B, Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, Llama-SEA-LION-v2-8B-IT) on a curated set of 200 Khmer QA pairs using six RAGAS-inspired metrics, reporting that no single generator dominates.
Significance. If the empirical results hold, the work supplies one of the first systematic comparisons of modern embedding and generation models for RAG in Khmer, a low-resource non-Latin-script language. The observation that retrieval performance remains the primary bottleneck while generator strengths are metric-dependent is actionable for practitioners working on similar Southeast-Asian languages. The use of direct held-out evaluation with standard metrics is a methodological strength.
major comments (1)
- [Abstract] Abstract and dataset-construction section: all reported retrieval and generation metrics are computed exclusively on one fixed set of 200 curated Khmer QA pairs. The manuscript provides no information on sourcing of the questions, generation protocol, expert validation steps, topic coverage, linguistic diversity, or difficulty distribution. Because the central comparative claims (BGE-M3 superiority and absence of a dominant generator) rest entirely on performance differences observed on this set, the lack of documentation leaves the results vulnerable to selection bias and limits their generalizability.
minor comments (2)
- [Abstract] The abstract states that six 'RAGAS-inspired' metrics are used but does not specify which exact RAGAS implementations or any Khmer-specific adaptations (e.g., tokenization or embedding choices for faithfulness scoring) were employed.
- Retrieval results would be easier to interpret if the total number of documents in the corpus and the average document length were reported alongside the Hit Rate and MRR figures.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding dataset documentation below and will incorporate the necessary revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract and dataset-construction section: all reported retrieval and generation metrics are computed exclusively on one fixed set of 200 curated Khmer QA pairs. The manuscript provides no information on sourcing of the questions, generation protocol, expert validation steps, topic coverage, linguistic diversity, or difficulty distribution. Because the central comparative claims (BGE-M3 superiority and absence of a dominant generator) rest entirely on performance differences observed on this set, the lack of documentation leaves the results vulnerable to selection bias and limits their generalizability.
Authors: We agree that the manuscript would benefit from expanded documentation of the 200 QA pairs to better support the comparative claims and address potential concerns about selection bias. In the revised version, we will add a dedicated subsection detailing the dataset construction process, including the sourcing of questions from authentic Khmer telecom-domain sources, the generation protocol, expert validation procedures, topic coverage across key telecom areas, linguistic diversity considerations, and the distribution of question difficulty levels. This will improve transparency and help readers assess the generalizability of the findings to other low-resource Southeast Asian languages. revision: yes
Circularity Check
No circularity in empirical benchmarks
full rationale
The paper reports direct empirical measurements of retrieval metrics (Hit Rate@3, MRR@3, etc.) for three embedding models and six RAGAS-inspired metrics for five generators on a fixed set of 200 curated Khmer QA pairs. No equations, first-principles derivations, or predictions are presented that reduce to fitted parameters or self-referential definitions by construction. Central claims rest on straightforward benchmark results rather than any load-bearing self-citation chain or ansatz smuggled via prior work. The evaluation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard RAGAS-style metrics faithfully capture faithfulness, relevance, and correctness for Khmer text.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We benchmark three embedding models... BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
uttler, Mike Lewis, Wen tau Yih, Tim Rockt
PatrickLewis, EthanPerez, AleksandraPiktus, FabioPetroni, VladimirKarpukhin, Naman Goyal, Heinrich K"uttler, Mike Lewis, Wen tau Yih, Tim Rockt"aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459– 9474, Red Hook, N...
work page 2020
-
[2]
Retrieval-augmented generation for large language models: A survey, 2024
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024
work page 2024
-
[3]
Ragas: Automated evaluation of retrieval augmented generation
Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstra- tions, pages 150–158, St. Julian’s, Malta, 2024. Association for Computational Linguistics
work page 2024
-
[4]
Evaluation of RAG metrics for question answering in the telecom domain
Sujoy Roychowdhury, Sumit Soman, H G Ranjani, Neeraj Gunda, Vansh Chhabra, and Sai Krishna Bala. Evaluation of RAG metrics for question answering in the telecom domain. InICML 2024 Workshop on Foundation Models in the Wild, pages 1–7, Vienna, Austria,
work page 2024
- [5]
-
[6]
Survey of hallucination in natural language generation
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023
work page 2023
-
[7]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, 2020. Association for Computational Linguistics
work page 2020
-
[8]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE M3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024
work page 2024
-
[9]
Sara Bourbour Hosseinbeigi, Sina Asghari, Mohammad Ali Seif Kashani, Mohammad Hos- sein Shalchian, and Mohammad Amin Abbasi. Advancing retrieval-augmented generation for persian: Development of language models, comprehensive benchmarks, and best prac- tices for optimization, 2025
work page 2025
-
[10]
BLEU: A method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the ACL, pages 311–318, Philadelphia, PA, USA, 2002. Association for Computational Linguistics
work page 2002
-
[11]
ROUGE: A package for automatic evaluation of summaries
Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summa- rization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computational Linguistics. 11
work page 2004
-
[12]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT, 2019
work page 2019
-
[13]
G-eval: Nlg evaluation using gpt-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, 2023. Association for Computational Linguistics
work page 2023
-
[14]
Ragas: Retrieval augmented generation assessment.https://gith ub.com/vibrantlabsai/ragas, 2026
Exploding Gradients. Ragas: Retrieval augmented generation assessment.https://gith ub.com/vibrantlabsai/ragas, 2026. GitHub repository, accessed 2026-03-12
work page 2026
-
[15]
Benchmarking large language models in retrieval-augmented generation
Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762, Vancouver, Canada, 2024. AAAI Press
work page 2024
-
[16]
Sea-lion: Southeast asian languages in one network, 2025
Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Hamsaward- hini Rengarajan, Peerat Limkonchotiwat, Adithya Venkatadri Hulagadri, Kok Wai Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao Ong, Jann Railey Montala...
-
[17]
Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min Lin
Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Ky- dlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Niratti- sai Thongchim, T...
-
[18]
Fine-tuning for question answering in low-resource languages: A case study on khmer
Kimleang Ly, Dona Valy, and Phutphalla Kong. Fine-tuning for question answering in low-resource languages: A case study on khmer. In2024 17th International Congress on Advanced Applied Informatics (IIAI-AAI-Winter), pages 162–165, Kitakyushu, Japan,
-
[19]
Ollama: Run large language models locally.https://ollama.com, 2024
Ollama. Ollama: Run large language models locally.https://ollama.com, 2024
work page 2024
-
[20]
jina-embeddings-v3: Multilingual embeddings with task lora, 2024
SabaSturua, IsabelleMohr, MohammadKalimAkram, MichaelGünther, BoWang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, and Han Xiao. jina-embeddings-v3: Multilingual embeddings with task lora, 2024
work page 2024
-
[21]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Qwen3.5: Accelerating productivity with native multimodal agents, February
Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February
-
[24]
URLhttps://qwen.ai/blog?id=qwen3.5. 12
-
[25]
Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li, and Lidong Bing. Seallms 3: Open foundation and chat multilingual large language models for southeast asian lan- guages, 2024
work page 2024
-
[26]
GPT-4o mini.https://openai.com/index/gpt-4o-mini-advancing-cost-eff icient-intelligence, 2024
OpenAI. GPT-4o mini.https://openai.com/index/gpt-4o-mini-advancing-cost-eff icient-intelligence, 2024
work page 2024
-
[27]
Given a question and answer, create one or more statements from each sentence in the given answer
HaraldSteck, ChaitanyaEkanadham, andNathanKallus. Iscosine-similarityofembeddings really about similarity? InCompanion Proceedings of the ACM on Web Conference 2024, pages 887–890, New York, NY, USA, 2024. Association for Computing Machinery. A Computation of RAGAS Metrics We refer the reader to Es et al. [3] and Roychowdhury et al. [4] for details on the...
work page 2024
-
[28]
The code used to unsubscribe from mobile supplementary services (VAS) iscode *1200#
-
[29]
Code *1200# can be used without requiring balance top-up actions or identity verification
-
[30]
To unsubscribe from supplementary services, the user dials *1200# and presses send
-
[31]
Verification verdicts:Statement 1: Yes; Statement 2: Yes; Statement 3: Yes; Statement 4: Yes
Unsubscribing from supplementary services does not require checking balance or contacting customer service. Verification verdicts:Statement 1: Yes; Statement 2: Yes; Statement 3: Yes; Statement 4: Yes. Computation:supported statements= 4, total statements= 4, thereforeFaiFul= 4/4 = 1.000. Answer relevance Three reverse questions were generated from the an...
-
[32]
Code *1200# can be used to unsubscribe from unwanted mobile supplementary services. False positives (FP):
-
[33]
The code to unsubscribe from mobile supplementary services (VAS) without needing a balance check, subscriber identity verification, top-up, or customer-service contact iscode *1200#
-
[34]
By dialing *1200# and pressing send, users can unsubscribe from supplementary services without balance top-up or identity-verification actions, as announced in the public notice. False negatives (FN):None. Computation:TP= 1, FP= 2, FN= 0; precision= 1/(1 + 2) = 1/3, recall= 1/(1 + 0) = 1, and F1 = 2×(1/3)×1/(1/3 + 1) = 0.500. Answer similarity / correctne...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.