Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents
Pith reviewed 2026-05-22 06:12 UTC · model grok-4.3
The pith
Character-based recursive chunking at 300 characters outperforms other splits when embedding Khmer agricultural documents for retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a RAG pipeline applied to Khmer agricultural documents, the four chunking methods—Recursive, Khmer-Aware, Sentence-Based, and LLM-Based—are compared after encoding chunks with the BGE-M3 model and retrieving via FAISS. The character-based Recursive method with a 300-character chunk size records the lowest average L2 distance (0.4295), the highest answer relevance (0.8663), and the highest Khmer IoU (0.6441), with a paired t-test confirming a significant L2 improvement over sentence-based chunking.
What carries the argument
The character-based Recursive chunking method, which repeatedly divides text while respecting a fixed character limit and preserving structure, to produce chunks for embedding and similarity search.
If this is right
- Segmentation granularity and structural preservation become central design choices for dense retrieval in morphologically complex low-resource languages.
- Recursive chunking can be expected to improve answer quality over sentence-based or LLM-based alternatives in Khmer RAG applications.
- The observed statistical edge in L2 distance supplies a concrete basis for preferring the 300-character recursive approach in similar agricultural document settings.
- Five-fold cross-validation on the 18 pairs offers a replicable protocol for testing future chunking variants on the same data.
Where Pith is reading between the lines
- The same recursive size and method could be tried on other Southeast Asian low-resource languages that share script or morphological traits with Khmer.
- Production systems would still need separate checks against user query logs and larger test sets before deployment.
- Pairing the winning chunking strategy with query expansion or reranking steps might yield further gains that the current comparison leaves untested.
Load-bearing premise
Results from only 18 question-answer pairs and the chosen set of retrieval and overlap metrics are sufficient to decide which chunking method will work best in real Khmer agricultural document systems.
What would settle it
Repeating the evaluation on a fresh collection of several hundred question-answer pairs drawn from the same or similar documents, or measuring actual user success rates when the system answers new queries, would show whether the 300-character recursive method keeps its lead.
Figures
read the original abstract
In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding model and retrieved using the FAISS library. Performance is evaluated using four metrics: Average Retrieval Score (L2 distance), Answer Relevance, Khmer Coverage, and Khmer Intersection over Union, all measured against ground-truth question-answer pairs. For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs. We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 +- 0.0461), highest Answer Relevance (0.8663 +- 0.0199), and highest Khmer IoU (0.6441 +- 0.0347). A paired t-test shows a statistically significant improvement over the Sentence-Based chunking method in L2 distance (p = 0.0121). These results highlight the importance of segmentation granularity and structural preservation for optimizing dense retrieval in morphologically complex, low-resource languages such as Khmer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates four chunking strategies (Recursive, Khmer-Aware, Sentence-Based, and LLM-Based) for RAG on Khmer agricultural documents. Chunks are embedded with BGE-M3 and retrieved via FAISS; performance is measured via 5-fold cross-validation on 18 ground-truth QA pairs using L2 distance, Answer Relevance, Khmer Coverage, and Khmer IoU. The central empirical finding is that character-based Recursive chunking at size 300 achieves the lowest L2 distance (0.4295 ± 0.0461), highest Answer Relevance (0.8663 ± 0.0199), and highest Khmer IoU (0.6441 ± 0.0347), with a paired t-test indicating significance versus Sentence-Based chunking on L2 (p=0.0121).
Significance. If the comparative ranking holds, the work supplies concrete, language-specific guidance on segmentation granularity for dense retrieval in morphologically complex low-resource languages within agricultural RAG pipelines. The use of Khmer IoU and Coverage metrics, together with reported standard errors and 5-fold CV, represents a modest but useful step toward reproducible evaluation in this setting. The absence of a non-RAG baseline and the narrow evaluation scope, however, constrain how far the optimality claim can be generalized.
major comments (2)
- [Evaluation] Evaluation section: the optimality conclusion for Recursive chunking rests on 5-fold CV over only 18 QA pairs (roughly 3–4 test instances per fold). With such limited data the reported means and standard deviations (e.g., L2 0.4295 ± 0.0461) are sensitive to pair selection; the single paired t-test does not correct for multiple comparisons across four strategies and four metrics, weakening the statistical support for declaring one method best.
- [Experimental Setup] Experimental Setup and Results: no retrieval baseline that omits RAG (or uses a different embedding model) is reported. While the paper focuses on relative chunking performance, the lack of such a reference makes it difficult to judge whether the observed metric differences translate to practically meaningful gains in Khmer agricultural retrieval.
minor comments (2)
- [Abstract] Abstract: the metric list includes both 'Khmer Coverage' and 'Khmer Intersection over Union'; the results paragraph reports only the latter. Clarify whether Coverage is a distinct metric or a reporting omission.
- [Abstract] Notation: the abstract refers to 'Average Retrieval Score (L2 distance)' while the results use plain 'L2 distance'. Adopt consistent terminology throughout.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the optimality conclusion for Recursive chunking rests on 5-fold CV over only 18 QA pairs (roughly 3–4 test instances per fold). With such limited data the reported means and standard deviations (e.g., L2 0.4295 ± 0.0461) are sensitive to pair selection; the single paired t-test does not correct for multiple comparisons across four strategies and four metrics, weakening the statistical support for declaring one method best.
Authors: We acknowledge that 18 QA pairs constitute a modest evaluation set, a limitation inherent to the scarcity of annotated Khmer agricultural data. These pairs represent the complete ground-truth collection available for this domain-specific task. Five-fold cross-validation was used to obtain more stable estimates from the limited data. We agree that the single paired t-test should account for multiple comparisons. In the revised manuscript we now apply a Bonferroni correction across the four metrics and report both raw and adjusted p-values. We have also added an explicit limitations subsection discussing sensitivity to pair selection and modest statistical power. revision: yes
-
Referee: [Experimental Setup] Experimental Setup and Results: no retrieval baseline that omits RAG (or uses a different embedding model) is reported. While the paper focuses on relative chunking performance, the lack of such a reference makes it difficult to judge whether the observed metric differences translate to practically meaningful gains in Khmer agricultural retrieval.
Authors: The manuscript’s stated goal is to compare chunking strategies inside a fixed RAG pipeline rather than to benchmark RAG against non-retrieval methods. To address the concern about practical significance, the revised version includes a non-RAG baseline in which the LLM generates answers directly from the full documents without retrieval. We report Answer Relevance for this baseline and show that the best RAG configuration (Recursive chunking at 300 characters) yields a statistically higher score. Retrieval-specific metrics (L2 distance, Khmer IoU) are noted as inapplicable to the non-RAG case. We have clarified in the text that all optimality claims are relative to other chunking approaches within RAG. revision: yes
Circularity Check
No circularity: direct empirical measurement on held-out QA pairs
full rationale
The paper conducts an empirical comparison of four chunking strategies (Recursive, Khmer-Aware, Sentence-Based, LLM-Based) inside a RAG pipeline. Chunks are encoded with the fixed BGE-M3 model and retrieved via FAISS; performance is measured by L2 distance, Answer Relevance, Khmer Coverage, and Khmer IoU against 18 ground-truth QA pairs under 5-fold cross-validation. No equations, fitted parameters, self-referential predictions, or derivations appear; the reported best performer (Recursive, size 300) is simply the strategy that produced the observed metric values on the held-out folds. The single paired t-test is a post-hoc statistical check on the same empirical results and does not create a circular loop. The analysis is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption BGE-M3 produces useful embeddings for Khmer agricultural text without additional fine-tuning.
- domain assumption The 18 ground-truth QA pairs are representative of typical agricultural queries in Khmer.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 ± 0.0461), highest Answer Relevance (0.8663 ± 0.0199), and highest Khmer IoU (0.6441 ± 0.0347).
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A paired t-test shows a statistically significant improvement over the Sentence-Based chunking method in L2 distance (p = 0.0121).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Retrieval-augmented generation for knowledge- intensive nlp tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-T. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge- intensive nlp tasks,” in Advances in Neural Information Processing Systems (NeurIPS) , vol. 33, 2020, pp. 9459–9474
work page 2020
-
[2]
J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, “M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,”
-
[3]
[Online]. A vailable: https://arxiv.org/abs/2402.03216
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Billion-scale similarity search with GPUs
J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with gpus,” 2017. [Online]. A vailable: https://arxiv.org/abs/1702.08734
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
Is semantic chunking worth the computational cost?
R. Qu, R. Tu, and F. Bao, “Is semantic chunking worth the computational cost?” 2024. [Online]. A vailable: https://arxiv.org/abs/2410.13070 10
-
[6]
Sentence-bert: Sentence embeddings using siamese bert- networks,
N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert- networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992
work page 2019
-
[7]
Late chunking: Contextual chunk embeddings using long-context embedding models,
M. Günther, I. Mohr, D. J. Williams, B. Wang, and H. Xiao, “Late chunking: Contextual chunk embeddings using long-context embedding models,” 2025
work page 2025
-
[8]
REALM: Retrieval-Augmented Language Model Pre-Training
K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “Realm: Retrieval-augmented lan- guage model pre-training,” arXiv preprint arXiv:2002.08909 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[9]
Khmer word segmentation using conditional random fields,
V. Chea, Y. Kyaw, C. Ding, M. Utiyama, A. Finch, and E. Sumita, “Khmer word segmentation using conditional random fields,” 2015
work page 2015
-
[10]
Khmer word segmentation using bilstm networks,
R. Buoy, N. Taing, and S. Kor, “Khmer word segmentation using bilstm networks,” 2020
work page 2020
-
[11]
S. Sry and A. Nguyen, “A review of khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory,” Ho Chi Minh City Open University Journal of Science: Engineering and Technology , vol. 12, pp. 23–34, 2022
work page 2022
-
[12]
Text Segmentation as a Supervised Learning Task
O. Koshorek, A. Cohen, N. Mor, M. Rotman, and J. Berant, “Text segmentation as a supervised learning task,” 2018. [Online]. A vailable: https://arxiv.org/abs/1803.09337
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Text Segmentation based on Semantic Word Embeddings
A. A. Alemi and P. Ginsparg, “Text segmentation based on semantic word embeddings,” arXiv preprint arXiv:1503.05543 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[14]
Reconstructing context: Evaluating advanced chunking strategies for retrieval-augmented generation,
C. Merola and J. Singh, “Reconstructing context: Evaluating advanced chunking strategies for retrieval-augmented generation,” 2025. [Online]. A vailable: https://arxiv.org/abs/2504.19754
-
[15]
Vncorenlp: A vietnamese natural language processing toolkit,
T. Vu, D. Q. Nguyen, D. Q. Nguyen, M. Dras, and M. Johnson, “Vncorenlp: A vietnamese natural language processing toolkit,” in Proceedings of NAACL-HLT Demonstrations. Asso- ciation for Computational Linguistics, 2018, pp. 56–60
work page 2018
-
[16]
Sea-helm: Southeast asian holistic evaluation of language models,
Y. Susanto, A. V. Hulagadri, J. R. Montalan, J. G. Ngui, X. B. Yong, W. Leong, H. Rengarajan, P. Limkonchotiwat, Y. Mai, and W. C. Tjhi, “Sea-helm: Southeast asian holistic evaluation of language models,” 2025. [Online]. A vailable: https://arxiv.org/abs/2502.14301
-
[17]
W. Q. Leong, J. G. Ngui, Y. Susanto, H. Rengarajan, K. Sarveswaran, and W. C. Tjhi, “Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models,” 2023. [Online]. A vailable: https://arxiv.org/abs/2309.06085
-
[18]
I. Strauss, J. Yang, T. O’Reilly, S. Rosenblat, and I. Moure, The Attribution Crisis in LLM Search Results: Estimating Ecosystem Exploitation , Jun. 2025. [Online]. A vailable: http://dx.doi.org/10.35650/AIDP.4114.d.2025
-
[19]
Making a miracl: Multilingual information retrieval across a continuum of languages,
X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo, D. Alfonso-Hermelo, X. Li, Q. Liu, M. Rezagholizadeh, and J. Lin, “Making a miracl: Multilingual information retrieval across a continuum of languages,” 2022. [Online]. A vailable: https://arxiv.org/abs/2210.09984
-
[20]
A. F. Hidayatullah, R. Apong, D. T. C. Lai, and A. Qazi, “Pre-trained language model for code- mixed text in indonesian, javanese, and english using transformer,” Social Network Analysis and Mining , vol. 15, 2025
work page 2025
-
[21]
Lost in the Middle: How Language Models Use Long Contexts
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” arXiv preprint arXiv:2307.03172 , 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.