A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

Kimleang Ly; Phannet Pov; Ratanaktepi Chhor; Saksonita Khoeurn; Sereiwathna Ros; Wan-Sup Cho

arxiv: 2605.22099 · v1 · pith:HBGJBHJBnew · submitted 2026-05-21 · 💻 cs.CL

A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

Sereiwathna Ros , Phannet Pov , Ratanaktepi Chhor , Kimleang Ly , Wan-Sup Cho , Saksonita Khoeurn This is my paper

Pith reviewed 2026-05-22 06:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords KhmerRetrieval-Augmented GenerationRAGdense retrievalquestion answeringlanguage modelslow-resource languagestelecom domain

0 comments

The pith

BGE-M3 retrieves Khmer documents most effectively in RAG while no generator model leads on every quality metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates retrieval-augmented generation for question answering over Khmer-language telecom documents. It first compares three embedding models on dense retrieval and identifies BGE-M3 as the strongest performer across hit rate, file hit rate, MRR, and precision at top-3. It then fixes BGE-M3 as retriever and tests five generator models on a set of 200 curated question-answer pairs using six RAGAS-style metrics. Different generators prove strongest on different axes, with one leading faithfulness and context relevance, another factual correctness, and a third answer relevance, similarity, and correctness. A reader would care because the results identify concrete component choices that affect how reliably a RAG system can ground answers in retrieved evidence for a low-resource, non-Latin-script language.

Core claim

BGE-M3 achieves the highest retrieval scores with Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112 over Khmer documents. When this retriever is paired with each of five generators, Qwen3.5-9B scores highest on faithfulness and context relevance, Qwen3-8B on factual correctness, and SeaLLMs-v3-7B-Chat on answer relevance, answer similarity, and answer correctness.

What carries the argument

Two-phase benchmarking that first selects the best dense retriever among three embedding models and then measures five generator models against six RAGAS-inspired metrics on a fixed retriever.

If this is right

Retriever choice remains the larger performance bottleneck for Khmer RAG systems.
Generator selection should be driven by which metric matters most for a given use case.
Multilingual embedding models trained across many languages transfer better to Khmer retrieval than the alternatives tested.
Practical deployments will need to trade off among faithfulness, factual correctness, and semantic similarity rather than expect one model to optimize all three.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar side-by-side tests could identify reliable component combinations for other low-resource languages that use non-Latin scripts.
Hybrid systems that route queries to the generator strongest on the needed metric might improve end-to-end results without retraining.
The observed variation across metrics suggests that automatic metric weighting or learned reranking could be a useful next step.

Load-bearing premise

The curated golden dataset of 200 Khmer question-answer pairs is representative and free of selection bias for evaluating real-world performance in the telecom domain.

What would settle it

Repeating the full evaluation on an independently collected set of several hundred new Khmer questions drawn from actual telecom support logs or forums would show whether the same model rankings hold.

Figures

Figures reproduced from arXiv: 2605.22099 by Kimleang Ly, Phannet Pov, Ratanaktepi Chhor, Saksonita Khoeurn, Sereiwathna Ros, Wan-Sup Cho.

read the original abstract

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator backends: Qwen3 (8B), Qwen3.5 (9B), Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, and Llama-SEA-LION-v2-8B-IT, on a curated golden dataset of 200 Khmer question-answer pairs. To quantify system performance, we apply six RAGAS-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness. The results show no single model dominates across all metrics: Qwen3.5-9B achieves the highest faithfulness (0.859) and context relevance (0.726), Qwen3-8B attains the highest factual correctness (0.380), and SeaLLMs-v3-7B-Chat performs best on answer relevance (0.867), answer similarity (0.836), and answer correctness (0.599). These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BGE-M3 leads on retrieval for this Khmer telecom set but no generator dominates the metrics, though the 200-pair dataset lacks needed details on construction.

read the letter

This paper's core finding is that BGE-M3 outperforms the other two embedding models on retrieval metrics for Khmer documents, hitting 0.285 Hit Rate@3, while the generators show no single best performer across the six metrics. It does a useful job of running these comparisons in a low-resource setting that rarely gets this kind of attention. The two-phase setup, first selecting the retriever then testing generators on the 200-pair set, is practical and the reported numbers let you see the trade-offs clearly. Credit to them for focusing on a specific domain like telecom and using real metrics instead of just claiming improvements. The soft spot is the golden dataset. Without details on how the 200 Khmer QA pairs were created or validated, it's hard to rule out that the results reflect the particular questions chosen rather than general model behavior. The stress-test note is right on this. They also skip any mention of statistical tests, which would help on a set this size. This is for engineers or researchers trying to deploy RAG in Khmer or similar languages for practical use cases. Someone needing baseline numbers for model choice in non-English RAG would find it relevant. It has enough substance in the empirical results to go to a serious referee, who could push on the dataset and add some controls. I'd recommend peer review to strengthen the evaluation details.

Referee Report

1 major / 2 minor

Summary. The paper presents a two-phase empirical evaluation of RAG for Khmer telecom-domain QA. It benchmarks three dense retrievers (BGE-M3, Jina-Embeddings-v3, Qwen3-Embedding) over Khmer documents and finds BGE-M3 strongest (Hit Rate@3 = 0.285). Using BGE-M3 as retriever, it then compares five generators (Qwen3-8B, Qwen3.5-9B, Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, Llama-SEA-LION-v2-8B-IT) on a curated set of 200 Khmer QA pairs using six RAGAS-inspired metrics, reporting that no single generator dominates.

Significance. If the empirical results hold, the work supplies one of the first systematic comparisons of modern embedding and generation models for RAG in Khmer, a low-resource non-Latin-script language. The observation that retrieval performance remains the primary bottleneck while generator strengths are metric-dependent is actionable for practitioners working on similar Southeast-Asian languages. The use of direct held-out evaluation with standard metrics is a methodological strength.

major comments (1)

[Abstract] Abstract and dataset-construction section: all reported retrieval and generation metrics are computed exclusively on one fixed set of 200 curated Khmer QA pairs. The manuscript provides no information on sourcing of the questions, generation protocol, expert validation steps, topic coverage, linguistic diversity, or difficulty distribution. Because the central comparative claims (BGE-M3 superiority and absence of a dominant generator) rest entirely on performance differences observed on this set, the lack of documentation leaves the results vulnerable to selection bias and limits their generalizability.

minor comments (2)

[Abstract] The abstract states that six 'RAGAS-inspired' metrics are used but does not specify which exact RAGAS implementations or any Khmer-specific adaptations (e.g., tokenization or embedding choices for faithfulness scoring) were employed.
Retrieval results would be easier to interpret if the total number of documents in the corpus and the average document length were reported alongside the Hit Rate and MRR figures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding dataset documentation below and will incorporate the necessary revisions.

read point-by-point responses

Referee: [Abstract] Abstract and dataset-construction section: all reported retrieval and generation metrics are computed exclusively on one fixed set of 200 curated Khmer QA pairs. The manuscript provides no information on sourcing of the questions, generation protocol, expert validation steps, topic coverage, linguistic diversity, or difficulty distribution. Because the central comparative claims (BGE-M3 superiority and absence of a dominant generator) rest entirely on performance differences observed on this set, the lack of documentation leaves the results vulnerable to selection bias and limits their generalizability.

Authors: We agree that the manuscript would benefit from expanded documentation of the 200 QA pairs to better support the comparative claims and address potential concerns about selection bias. In the revised version, we will add a dedicated subsection detailing the dataset construction process, including the sourcing of questions from authentic Khmer telecom-domain sources, the generation protocol, expert validation procedures, topic coverage across key telecom areas, linguistic diversity considerations, and the distribution of question difficulty levels. This will improve transparency and help readers assess the generalizability of the findings to other low-resource Southeast Asian languages. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmarks

full rationale

The paper reports direct empirical measurements of retrieval metrics (Hit Rate@3, MRR@3, etc.) for three embedding models and six RAGAS-inspired metrics for five generators on a fixed set of 200 curated Khmer QA pairs. No equations, first-principles derivations, or predictions are presented that reduce to fitted parameters or self-referential definitions by construction. Central claims rest on straightforward benchmark results rather than any load-bearing self-citation chain or ansatz smuggled via prior work. The evaluation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on pre-trained models and standard RAG evaluation practices from prior work; no new free parameters, axioms, or invented entities are introduced.

axioms (1)

domain assumption Standard RAGAS-style metrics faithfully capture faithfulness, relevance, and correctness for Khmer text.
Invoked when reporting the six metrics as quantifiers of system performance.

pith-pipeline@v0.9.0 · 5955 in / 993 out tokens · 47946 ms · 2026-05-22T06:36:58.071849+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We benchmark three embedding models... BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

[1]

uttler, Mike Lewis, Wen tau Yih, Tim Rockt

PatrickLewis, EthanPerez, AleksandraPiktus, FabioPetroni, VladimirKarpukhin, Naman Goyal, Heinrich K"uttler, Mike Lewis, Wen tau Yih, Tim Rockt"aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459– 9474, Red Hook, N...

work page 2020
[2]

Retrieval-augmented generation for large language models: A survey, 2024

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024

work page 2024
[3]

Ragas: Automated evaluation of retrieval augmented generation

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstra- tions, pages 150–158, St. Julian’s, Malta, 2024. Association for Computational Linguistics

work page 2024
[4]

Evaluation of RAG metrics for question answering in the telecom domain

Sujoy Roychowdhury, Sumit Soman, H G Ranjani, Neeraj Gunda, Vansh Chhabra, and Sai Krishna Bala. Evaluation of RAG metrics for question answering in the telecom domain. InICML 2024 Workshop on Foundation Models in the Wild, pages 1–7, Vienna, Austria,

work page 2024
[5]

arXiv:2407.12873

PMLR. arXiv:2407.12873

work page arXiv
[6]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

work page 2023
[7]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, 2020. Association for Computational Linguistics

work page 2020
[8]

BGE M3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE M3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

work page 2024
[9]

Advancing retrieval-augmented generation for persian: Development of language models, comprehensive benchmarks, and best prac- tices for optimization, 2025

Sara Bourbour Hosseinbeigi, Sina Asghari, Mohammad Ali Seif Kashani, Mohammad Hos- sein Shalchian, and Mohammad Amin Abbasi. Advancing retrieval-augmented generation for persian: Development of language models, comprehensive benchmarks, and best prac- tices for optimization, 2025

work page 2025
[10]

BLEU: A method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the ACL, pages 311–318, Philadelphia, PA, USA, 2002. Association for Computational Linguistics

work page 2002
[11]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summa- rization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computational Linguistics. 11

work page 2004
[12]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT, 2019

work page 2019
[13]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, 2023. Association for Computational Linguistics

work page 2023
[14]

Ragas: Retrieval augmented generation assessment.https://gith ub.com/vibrantlabsai/ragas, 2026

Exploding Gradients. Ragas: Retrieval augmented generation assessment.https://gith ub.com/vibrantlabsai/ragas, 2026. GitHub repository, accessed 2026-03-12

work page 2026
[15]

Benchmarking large language models in retrieval-augmented generation

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762, Vancouver, Canada, 2024. AAAI Press

work page 2024
[16]

Sea-lion: Southeast asian languages in one network, 2025

Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Hamsaward- hini Rengarajan, Peerat Limkonchotiwat, Adithya Venkatadri Hulagadri, Kok Wai Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao Ong, Jann Railey Montala...

work page arXiv 2025
[17]

Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min Lin

Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Ky- dlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Niratti- sai Thongchim, T...

work page arXiv 2025
[18]

Fine-tuning for question answering in low-resource languages: A case study on khmer

Kimleang Ly, Dona Valy, and Phutphalla Kong. Fine-tuning for question answering in low-resource languages: A case study on khmer. In2024 17th International Congress on Advanced Applied Informatics (IIAI-AAI-Winter), pages 162–165, Kitakyushu, Japan,

work page
[19]

Ollama: Run large language models locally.https://ollama.com, 2024

Ollama. Ollama: Run large language models locally.https://ollama.com, 2024

work page 2024
[20]

jina-embeddings-v3: Multilingual embeddings with task lora, 2024

SabaSturua, IsabelleMohr, MohammadKalimAkram, MichaelGünther, BoWang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, and Han Xiao. jina-embeddings-v3: Multilingual embeddings with task lora, 2024

work page 2024
[21]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

work page
[24]

URLhttps://qwen.ai/blog?id=qwen3.5. 12

work page
[25]

Seallms 3: Open foundation and chat multilingual large language models for southeast asian lan- guages, 2024

Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li, and Lidong Bing. Seallms 3: Open foundation and chat multilingual large language models for southeast asian lan- guages, 2024

work page 2024
[26]

GPT-4o mini.https://openai.com/index/gpt-4o-mini-advancing-cost-eff icient-intelligence, 2024

OpenAI. GPT-4o mini.https://openai.com/index/gpt-4o-mini-advancing-cost-eff icient-intelligence, 2024

work page 2024
[27]

Given a question and answer, create one or more statements from each sentence in the given answer

HaraldSteck, ChaitanyaEkanadham, andNathanKallus. Iscosine-similarityofembeddings really about similarity? InCompanion Proceedings of the ACM on Web Conference 2024, pages 887–890, New York, NY, USA, 2024. Association for Computing Machinery. A Computation of RAGAS Metrics We refer the reader to Es et al. [3] and Roychowdhury et al. [4] for details on the...

work page 2024
[28]

The code used to unsubscribe from mobile supplementary services (VAS) iscode *1200#

work page
[29]

Code *1200# can be used without requiring balance top-up actions or identity verification

work page
[30]

To unsubscribe from supplementary services, the user dials *1200# and presses send

work page
[31]

Verification verdicts:Statement 1: Yes; Statement 2: Yes; Statement 3: Yes; Statement 4: Yes

Unsubscribing from supplementary services does not require checking balance or contacting customer service. Verification verdicts:Statement 1: Yes; Statement 2: Yes; Statement 3: Yes; Statement 4: Yes. Computation:supported statements= 4, total statements= 4, thereforeFaiFul= 4/4 = 1.000. Answer relevance Three reverse questions were generated from the an...

work page
[32]

False positives (FP):

Code *1200# can be used to unsubscribe from unwanted mobile supplementary services. False positives (FP):

work page
[33]

The code to unsubscribe from mobile supplementary services (VAS) without needing a balance check, subscriber identity verification, top-up, or customer-service contact iscode *1200#

work page
[34]

False negatives (FN):None

By dialing *1200# and pressing send, users can unsubscribe from supplementary services without balance top-up or identity-verification actions, as announced in the public notice. False negatives (FN):None. Computation:TP= 1, FP= 2, FN= 0; precision= 1/(1 + 2) = 1/3, recall= 1/(1 + 0) = 1, and F1 = 2×(1/3)×1/(1/3 + 1) = 0.500. Answer similarity / correctne...

work page

[1] [1]

uttler, Mike Lewis, Wen tau Yih, Tim Rockt

PatrickLewis, EthanPerez, AleksandraPiktus, FabioPetroni, VladimirKarpukhin, Naman Goyal, Heinrich K"uttler, Mike Lewis, Wen tau Yih, Tim Rockt"aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459– 9474, Red Hook, N...

work page 2020

[2] [2]

Retrieval-augmented generation for large language models: A survey, 2024

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024

work page 2024

[3] [3]

Ragas: Automated evaluation of retrieval augmented generation

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstra- tions, pages 150–158, St. Julian’s, Malta, 2024. Association for Computational Linguistics

work page 2024

[4] [4]

Evaluation of RAG metrics for question answering in the telecom domain

Sujoy Roychowdhury, Sumit Soman, H G Ranjani, Neeraj Gunda, Vansh Chhabra, and Sai Krishna Bala. Evaluation of RAG metrics for question answering in the telecom domain. InICML 2024 Workshop on Foundation Models in the Wild, pages 1–7, Vienna, Austria,

work page 2024

[5] [5]

arXiv:2407.12873

PMLR. arXiv:2407.12873

work page arXiv

[6] [6]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

work page 2023

[7] [7]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, 2020. Association for Computational Linguistics

work page 2020

[8] [8]

BGE M3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE M3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

work page 2024

[9] [9]

Advancing retrieval-augmented generation for persian: Development of language models, comprehensive benchmarks, and best prac- tices for optimization, 2025

Sara Bourbour Hosseinbeigi, Sina Asghari, Mohammad Ali Seif Kashani, Mohammad Hos- sein Shalchian, and Mohammad Amin Abbasi. Advancing retrieval-augmented generation for persian: Development of language models, comprehensive benchmarks, and best prac- tices for optimization, 2025

work page 2025

[10] [10]

BLEU: A method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the ACL, pages 311–318, Philadelphia, PA, USA, 2002. Association for Computational Linguistics

work page 2002

[11] [11]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summa- rization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Computational Linguistics. 11

work page 2004

[12] [12]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT, 2019

work page 2019

[13] [13]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, 2023. Association for Computational Linguistics

work page 2023

[14] [14]

Ragas: Retrieval augmented generation assessment.https://gith ub.com/vibrantlabsai/ragas, 2026

Exploding Gradients. Ragas: Retrieval augmented generation assessment.https://gith ub.com/vibrantlabsai/ragas, 2026. GitHub repository, accessed 2026-03-12

work page 2026

[15] [15]

Benchmarking large language models in retrieval-augmented generation

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762, Vancouver, Canada, 2024. AAAI Press

work page 2024

[16] [16]

Sea-lion: Southeast asian languages in one network, 2025

Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Hamsaward- hini Rengarajan, Peerat Limkonchotiwat, Adithya Venkatadri Hulagadri, Kok Wai Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao Ong, Jann Railey Montala...

work page arXiv 2025

[17] [17]

Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min Lin

Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Ky- dlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Niratti- sai Thongchim, T...

work page arXiv 2025

[18] [18]

Fine-tuning for question answering in low-resource languages: A case study on khmer

Kimleang Ly, Dona Valy, and Phutphalla Kong. Fine-tuning for question answering in low-resource languages: A case study on khmer. In2024 17th International Congress on Advanced Applied Informatics (IIAI-AAI-Winter), pages 162–165, Kitakyushu, Japan,

work page

[19] [19]

Ollama: Run large language models locally.https://ollama.com, 2024

Ollama. Ollama: Run large language models locally.https://ollama.com, 2024

work page 2024

[20] [20]

jina-embeddings-v3: Multilingual embeddings with task lora, 2024

SabaSturua, IsabelleMohr, MohammadKalimAkram, MichaelGünther, BoWang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, and Han Xiao. jina-embeddings-v3: Multilingual embeddings with task lora, 2024

work page 2024

[21] [21]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

work page

[24] [24]

URLhttps://qwen.ai/blog?id=qwen3.5. 12

work page

[25] [25]

Seallms 3: Open foundation and chat multilingual large language models for southeast asian lan- guages, 2024

Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li, and Lidong Bing. Seallms 3: Open foundation and chat multilingual large language models for southeast asian lan- guages, 2024

work page 2024

[26] [26]

GPT-4o mini.https://openai.com/index/gpt-4o-mini-advancing-cost-eff icient-intelligence, 2024

OpenAI. GPT-4o mini.https://openai.com/index/gpt-4o-mini-advancing-cost-eff icient-intelligence, 2024

work page 2024

[27] [27]

Given a question and answer, create one or more statements from each sentence in the given answer

HaraldSteck, ChaitanyaEkanadham, andNathanKallus. Iscosine-similarityofembeddings really about similarity? InCompanion Proceedings of the ACM on Web Conference 2024, pages 887–890, New York, NY, USA, 2024. Association for Computing Machinery. A Computation of RAGAS Metrics We refer the reader to Es et al. [3] and Roychowdhury et al. [4] for details on the...

work page 2024

[28] [28]

The code used to unsubscribe from mobile supplementary services (VAS) iscode *1200#

work page

[29] [29]

Code *1200# can be used without requiring balance top-up actions or identity verification

work page

[30] [30]

To unsubscribe from supplementary services, the user dials *1200# and presses send

work page

[31] [31]

Verification verdicts:Statement 1: Yes; Statement 2: Yes; Statement 3: Yes; Statement 4: Yes

Unsubscribing from supplementary services does not require checking balance or contacting customer service. Verification verdicts:Statement 1: Yes; Statement 2: Yes; Statement 3: Yes; Statement 4: Yes. Computation:supported statements= 4, total statements= 4, thereforeFaiFul= 4/4 = 1.000. Answer relevance Three reverse questions were generated from the an...

work page

[32] [32]

False positives (FP):

Code *1200# can be used to unsubscribe from unwanted mobile supplementary services. False positives (FP):

work page

[33] [33]

The code to unsubscribe from mobile supplementary services (VAS) without needing a balance check, subscriber identity verification, top-up, or customer-service contact iscode *1200#

work page

[34] [34]

False negatives (FN):None

By dialing *1200# and pressing send, users can unsubscribe from supplementary services without balance top-up or identity-verification actions, as announced in the public notice. False negatives (FN):None. Computation:TP= 1, FP= 2, FN= 0; precision= 1/(1 + 2) = 1/3, recall= 1/(1 + 0) = 1, and F1 = 2×(1/3)×1/(1/3 + 1) = 0.500. Answer similarity / correctne...

work page