Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

Pichdara Po; Saksonita Khoeurn; Sereiwathna Ros; Sovandara Chhoun; Wan-Sup Cho

arxiv: 2605.22203 · v1 · pith:U4I225ZVnew · submitted 2026-05-21 · 💻 cs.CL

Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

Sovandara Chhoun , Pichdara Po , Sereiwathna Ros , Wan-Sup Cho , Saksonita Khoeurn This is my paper

Pith reviewed 2026-05-22 06:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords text chunkingretrieval augmented generationlow-resource languagesKhmer languageagricultural documentstext embeddingRAG evaluation

0 comments

The pith

Character-based recursive chunking at 300 characters outperforms other splits when embedding Khmer agricultural documents for retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests four ways to divide Khmer farming texts before they enter a retrieval-augmented generation system that uses a multilingual embedding model. It measures each approach by how closely the retrieved chunks match ground-truth answers, using distance between embeddings, answer relevance, and the amount of original Khmer text preserved. Recursive chunking with a 300-character limit records the strongest scores across these measures and shows a statistically detectable edge over sentence-based splitting. Readers who care about building reliable search tools for low-resource languages will see that the choice of split size and method directly affects how well the system surfaces accurate information from complex scripts and limited training data.

Core claim

In a RAG pipeline applied to Khmer agricultural documents, the four chunking methods—Recursive, Khmer-Aware, Sentence-Based, and LLM-Based—are compared after encoding chunks with the BGE-M3 model and retrieving via FAISS. The character-based Recursive method with a 300-character chunk size records the lowest average L2 distance (0.4295), the highest answer relevance (0.8663), and the highest Khmer IoU (0.6441), with a paired t-test confirming a significant L2 improvement over sentence-based chunking.

What carries the argument

The character-based Recursive chunking method, which repeatedly divides text while respecting a fixed character limit and preserving structure, to produce chunks for embedding and similarity search.

If this is right

Segmentation granularity and structural preservation become central design choices for dense retrieval in morphologically complex low-resource languages.
Recursive chunking can be expected to improve answer quality over sentence-based or LLM-based alternatives in Khmer RAG applications.
The observed statistical edge in L2 distance supplies a concrete basis for preferring the 300-character recursive approach in similar agricultural document settings.
Five-fold cross-validation on the 18 pairs offers a replicable protocol for testing future chunking variants on the same data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recursive size and method could be tried on other Southeast Asian low-resource languages that share script or morphological traits with Khmer.
Production systems would still need separate checks against user query logs and larger test sets before deployment.
Pairing the winning chunking strategy with query expansion or reranking steps might yield further gains that the current comparison leaves untested.

Load-bearing premise

Results from only 18 question-answer pairs and the chosen set of retrieval and overlap metrics are sufficient to decide which chunking method will work best in real Khmer agricultural document systems.

What would settle it

Repeating the evaluation on a fresh collection of several hundred question-answer pairs drawn from the same or similar documents, or measuring actual user success rates when the system answers new queries, would show whether the 300-character recursive method keeps its lead.

Figures

Figures reproduced from arXiv: 2605.22203 by Pichdara Po, Saksonita Khoeurn, Sereiwathna Ros, Sovandara Chhoun, Wan-Sup Cho.

**Figure 1.** Figure 1: End-to-end pipeline chunking evaluation. suitable candidate for Khmer text retrieval tasks. BGE-M3 demonstrated robust performance in a multilingual environment in the MIRACL benchmark, where it handled various diverse languages, including low-resource ones [18]. Nevertheless, it is important to note that even with a robust embedding model, document segmentation strategies are of equal importance in determ… view at source ↗

read the original abstract

In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding model and retrieved using the FAISS library. Performance is evaluated using four metrics: Average Retrieval Score (L2 distance), Answer Relevance, Khmer Coverage, and Khmer Intersection over Union, all measured against ground-truth question-answer pairs. For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs. We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 +- 0.0461), highest Answer Relevance (0.8663 +- 0.0199), and highest Khmer IoU (0.6441 +- 0.0347). A paired t-test shows a statistically significant improvement over the Sentence-Based chunking method in L2 distance (p = 0.0121). These results highlight the importance of segmentation granularity and structural preservation for optimizing dense retrieval in morphologically complex, low-resource languages such as Khmer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Recursive chunking at 300 chars comes out ahead on Khmer ag docs, but the 18 QA pairs make any optimality claim shaky.

read the letter

The key takeaway is that recursive character-based chunking with a size of 300 characters outperformed the other three strategies on their Khmer agricultural documents. It showed the lowest L2 distance, highest answer relevance, and best Khmer IoU in the RAG pipeline using BGE-M3 embeddings and FAISS retrieval. This work is new in the sense that it provides specific performance numbers for chunking methods applied to Khmer text in the agriculture domain. The literature has general studies on chunking for RAG, but nothing directly comparable for this low-resource language and narrow topic. They evaluated four approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based, all against 18 ground-truth QA pairs using 5-fold cross-validation. The paper does a decent job with the evaluation. They report means with standard errors and run a paired t-test for one comparison, which adds some rigor. The metrics include both standard retrieval scores and language-specific ones like Khmer Coverage and IoU, which makes sense for checking how well the chunks preserve the content. Where it gets soft is the scale of the experiment. Only 18 QA pairs total means each fold has just a handful of test cases. The reported standard deviations might not capture the full variability, and the single t-test doesn't account for testing multiple strategies and metrics. It's possible the ranking would shift with a different or larger set of questions. They also skip any comparison to a simple keyword search or no-retrieval setup, so we don't know the practical impact. Readers who are implementing RAG for similar low-resource languages or agricultural knowledge bases will find this relevant. It offers concrete guidance on chunk sizes and methods that worked here. For someone looking for theoretical advances in embeddings or retrieval, there's less to take away. I think this paper should go to peer review. The empirical comparison is solid enough for the niche, and referees can help strengthen the statistical claims and suggest ways to expand the evaluation. It's not groundbreaking, but it's a useful data point for practitioners.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates four chunking strategies (Recursive, Khmer-Aware, Sentence-Based, and LLM-Based) for RAG on Khmer agricultural documents. Chunks are embedded with BGE-M3 and retrieved via FAISS; performance is measured via 5-fold cross-validation on 18 ground-truth QA pairs using L2 distance, Answer Relevance, Khmer Coverage, and Khmer IoU. The central empirical finding is that character-based Recursive chunking at size 300 achieves the lowest L2 distance (0.4295 ± 0.0461), highest Answer Relevance (0.8663 ± 0.0199), and highest Khmer IoU (0.6441 ± 0.0347), with a paired t-test indicating significance versus Sentence-Based chunking on L2 (p=0.0121).

Significance. If the comparative ranking holds, the work supplies concrete, language-specific guidance on segmentation granularity for dense retrieval in morphologically complex low-resource languages within agricultural RAG pipelines. The use of Khmer IoU and Coverage metrics, together with reported standard errors and 5-fold CV, represents a modest but useful step toward reproducible evaluation in this setting. The absence of a non-RAG baseline and the narrow evaluation scope, however, constrain how far the optimality claim can be generalized.

major comments (2)

[Evaluation] Evaluation section: the optimality conclusion for Recursive chunking rests on 5-fold CV over only 18 QA pairs (roughly 3–4 test instances per fold). With such limited data the reported means and standard deviations (e.g., L2 0.4295 ± 0.0461) are sensitive to pair selection; the single paired t-test does not correct for multiple comparisons across four strategies and four metrics, weakening the statistical support for declaring one method best.
[Experimental Setup] Experimental Setup and Results: no retrieval baseline that omits RAG (or uses a different embedding model) is reported. While the paper focuses on relative chunking performance, the lack of such a reference makes it difficult to judge whether the observed metric differences translate to practically meaningful gains in Khmer agricultural retrieval.

minor comments (2)

[Abstract] Abstract: the metric list includes both 'Khmer Coverage' and 'Khmer Intersection over Union'; the results paragraph reports only the latter. Clarify whether Coverage is a distinct metric or a reporting omission.
[Abstract] Notation: the abstract refers to 'Average Retrieval Score (L2 distance)' while the results use plain 'L2 distance'. Adopt consistent terminology throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions have been made to the manuscript.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the optimality conclusion for Recursive chunking rests on 5-fold CV over only 18 QA pairs (roughly 3–4 test instances per fold). With such limited data the reported means and standard deviations (e.g., L2 0.4295 ± 0.0461) are sensitive to pair selection; the single paired t-test does not correct for multiple comparisons across four strategies and four metrics, weakening the statistical support for declaring one method best.

Authors: We acknowledge that 18 QA pairs constitute a modest evaluation set, a limitation inherent to the scarcity of annotated Khmer agricultural data. These pairs represent the complete ground-truth collection available for this domain-specific task. Five-fold cross-validation was used to obtain more stable estimates from the limited data. We agree that the single paired t-test should account for multiple comparisons. In the revised manuscript we now apply a Bonferroni correction across the four metrics and report both raw and adjusted p-values. We have also added an explicit limitations subsection discussing sensitivity to pair selection and modest statistical power. revision: yes
Referee: [Experimental Setup] Experimental Setup and Results: no retrieval baseline that omits RAG (or uses a different embedding model) is reported. While the paper focuses on relative chunking performance, the lack of such a reference makes it difficult to judge whether the observed metric differences translate to practically meaningful gains in Khmer agricultural retrieval.

Authors: The manuscript’s stated goal is to compare chunking strategies inside a fixed RAG pipeline rather than to benchmark RAG against non-retrieval methods. To address the concern about practical significance, the revised version includes a non-RAG baseline in which the LLM generates answers directly from the full documents without retrieval. We report Answer Relevance for this baseline and show that the best RAG configuration (Recursive chunking at 300 characters) yields a statistically higher score. Retrieval-specific metrics (L2 distance, Khmer IoU) are noted as inapplicable to the non-RAG case. We have clarified in the text that all optimality claims are relative to other chunking approaches within RAG. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement on held-out QA pairs

full rationale

The paper conducts an empirical comparison of four chunking strategies (Recursive, Khmer-Aware, Sentence-Based, LLM-Based) inside a RAG pipeline. Chunks are encoded with the fixed BGE-M3 model and retrieved via FAISS; performance is measured by L2 distance, Answer Relevance, Khmer Coverage, and Khmer IoU against 18 ground-truth QA pairs under 5-fold cross-validation. No equations, fitted parameters, self-referential predictions, or derivations appear; the reported best performer (Recursive, size 300) is simply the strategy that produced the observed metric values on the held-out folds. The single paired t-test is a post-hoc statistical check on the same empirical results and does not create a circular loop. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the small set of 18 QA pairs adequately represents the distribution of real user queries and that the four chosen metrics capture retrieval quality for morphologically rich low-resource text.

axioms (2)

domain assumption BGE-M3 produces useful embeddings for Khmer agricultural text without additional fine-tuning.
The model is used directly; no justification or ablation for its suitability to Khmer is provided in the abstract.
domain assumption The 18 ground-truth QA pairs are representative of typical agricultural queries in Khmer.
Evaluation and statistical claims depend on this small fixed set.

pith-pipeline@v0.9.0 · 5771 in / 1472 out tokens · 54096 ms · 2026-05-22T06:12:19.466367+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 ± 0.0461), highest Answer Relevance (0.8663 ± 0.0199), and highest Khmer IoU (0.6441 ± 0.0347).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A paired t-test shows a statistically significant improvement over the Sentence-Based chunking method in L2 distance (p = 0.0121).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 6 internal anchors

[1]

Retrieval-augmented generation for knowledge- intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-T. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge- intensive nlp tasks,” in Advances in Neural Information Processing Systems (NeurIPS) , vol. 33, 2020, pp. 9459–9474

work page 2020
[2]

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,

J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, “M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,”

work page
[3]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

[Online]. A vailable: https://arxiv.org/abs/2402.03216

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Billion-scale similarity search with GPUs

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with gpus,” 2017. [Online]. A vailable: https://arxiv.org/abs/1702.08734

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

Is semantic chunking worth the computational cost?

R. Qu, R. Tu, and F. Bao, “Is semantic chunking worth the computational cost?” 2024. [Online]. A vailable: https://arxiv.org/abs/2410.13070 10

work page arXiv 2024
[6]

Sentence-bert: Sentence embeddings using siamese bert- networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert- networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992

work page 2019
[7]

Late chunking: Contextual chunk embeddings using long-context embedding models,

M. Günther, I. Mohr, D. J. Williams, B. Wang, and H. Xiao, “Late chunking: Contextual chunk embeddings using long-context embedding models,” 2025

work page 2025
[8]

REALM: Retrieval-Augmented Language Model Pre-Training

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “Realm: Retrieval-augmented lan- guage model pre-training,” arXiv preprint arXiv:2002.08909 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[9]

Khmer word segmentation using conditional random fields,

V. Chea, Y. Kyaw, C. Ding, M. Utiyama, A. Finch, and E. Sumita, “Khmer word segmentation using conditional random fields,” 2015

work page 2015
[10]

Khmer word segmentation using bilstm networks,

R. Buoy, N. Taing, and S. Kor, “Khmer word segmentation using bilstm networks,” 2020

work page 2020
[11]

A review of khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory,

S. Sry and A. Nguyen, “A review of khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory,” Ho Chi Minh City Open University Journal of Science: Engineering and Technology , vol. 12, pp. 23–34, 2022

work page 2022
[12]

Text Segmentation as a Supervised Learning Task

O. Koshorek, A. Cohen, N. Mor, M. Rotman, and J. Berant, “Text segmentation as a supervised learning task,” 2018. [Online]. A vailable: https://arxiv.org/abs/1803.09337

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Text Segmentation based on Semantic Word Embeddings

A. A. Alemi and P. Ginsparg, “Text segmentation based on semantic word embeddings,” arXiv preprint arXiv:1503.05543 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

Reconstructing context: Evaluating advanced chunking strategies for retrieval-augmented generation,

C. Merola and J. Singh, “Reconstructing context: Evaluating advanced chunking strategies for retrieval-augmented generation,” 2025. [Online]. A vailable: https://arxiv.org/abs/2504.19754

work page arXiv 2025
[15]

Vncorenlp: A vietnamese natural language processing toolkit,

T. Vu, D. Q. Nguyen, D. Q. Nguyen, M. Dras, and M. Johnson, “Vncorenlp: A vietnamese natural language processing toolkit,” in Proceedings of NAACL-HLT Demonstrations. Asso- ciation for Computational Linguistics, 2018, pp. 56–60

work page 2018
[16]

Sea-helm: Southeast asian holistic evaluation of language models,

Y. Susanto, A. V. Hulagadri, J. R. Montalan, J. G. Ngui, X. B. Yong, W. Leong, H. Rengarajan, P. Limkonchotiwat, Y. Mai, and W. C. Tjhi, “Sea-helm: Southeast asian holistic evaluation of language models,” 2025. [Online]. A vailable: https://arxiv.org/abs/2502.14301

work page arXiv 2025
[17]

Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models,

W. Q. Leong, J. G. Ngui, Y. Susanto, H. Rengarajan, K. Sarveswaran, and W. C. Tjhi, “Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models,” 2023. [Online]. A vailable: https://arxiv.org/abs/2309.06085

work page arXiv 2023
[18]

Strauss, J

I. Strauss, J. Yang, T. O’Reilly, S. Rosenblat, and I. Moure, The Attribution Crisis in LLM Search Results: Estimating Ecosystem Exploitation , Jun. 2025. [Online]. A vailable: http://dx.doi.org/10.35650/AIDP.4114.d.2025

work page doi:10.35650/aidp.4114.d.2025 2025
[19]

Making a miracl: Multilingual information retrieval across a continuum of languages,

X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo, D. Alfonso-Hermelo, X. Li, Q. Liu, M. Rezagholizadeh, and J. Lin, “Making a miracl: Multilingual information retrieval across a continuum of languages,” 2022. [Online]. A vailable: https://arxiv.org/abs/2210.09984

work page arXiv 2022
[20]

Pre-trained language model for code- mixed text in indonesian, javanese, and english using transformer,

A. F. Hidayatullah, R. Apong, D. T. C. Lai, and A. Qazi, “Pre-trained language model for code- mixed text in indonesian, javanese, and english using transformer,” Social Network Analysis and Mining , vol. 15, 2025

work page 2025
[21]

Lost in the Middle: How Language Models Use Long Contexts

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” arXiv preprint arXiv:2307.03172 , 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Retrieval-augmented generation for knowledge- intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-T. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge- intensive nlp tasks,” in Advances in Neural Information Processing Systems (NeurIPS) , vol. 33, 2020, pp. 9459–9474

work page 2020

[2] [2]

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,

J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, “M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,”

work page

[3] [3]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

[Online]. A vailable: https://arxiv.org/abs/2402.03216

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Billion-scale similarity search with GPUs

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with gpus,” 2017. [Online]. A vailable: https://arxiv.org/abs/1702.08734

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

Is semantic chunking worth the computational cost?

R. Qu, R. Tu, and F. Bao, “Is semantic chunking worth the computational cost?” 2024. [Online]. A vailable: https://arxiv.org/abs/2410.13070 10

work page arXiv 2024

[6] [6]

Sentence-bert: Sentence embeddings using siamese bert- networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert- networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992

work page 2019

[7] [7]

Late chunking: Contextual chunk embeddings using long-context embedding models,

M. Günther, I. Mohr, D. J. Williams, B. Wang, and H. Xiao, “Late chunking: Contextual chunk embeddings using long-context embedding models,” 2025

work page 2025

[8] [8]

REALM: Retrieval-Augmented Language Model Pre-Training

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “Realm: Retrieval-augmented lan- guage model pre-training,” arXiv preprint arXiv:2002.08909 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[9] [9]

Khmer word segmentation using conditional random fields,

V. Chea, Y. Kyaw, C. Ding, M. Utiyama, A. Finch, and E. Sumita, “Khmer word segmentation using conditional random fields,” 2015

work page 2015

[10] [10]

Khmer word segmentation using bilstm networks,

R. Buoy, N. Taing, and S. Kor, “Khmer word segmentation using bilstm networks,” 2020

work page 2020

[11] [11]

A review of khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory,

S. Sry and A. Nguyen, “A review of khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory,” Ho Chi Minh City Open University Journal of Science: Engineering and Technology , vol. 12, pp. 23–34, 2022

work page 2022

[12] [12]

Text Segmentation as a Supervised Learning Task

O. Koshorek, A. Cohen, N. Mor, M. Rotman, and J. Berant, “Text segmentation as a supervised learning task,” 2018. [Online]. A vailable: https://arxiv.org/abs/1803.09337

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Text Segmentation based on Semantic Word Embeddings

A. A. Alemi and P. Ginsparg, “Text segmentation based on semantic word embeddings,” arXiv preprint arXiv:1503.05543 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[14] [14]

Reconstructing context: Evaluating advanced chunking strategies for retrieval-augmented generation,

C. Merola and J. Singh, “Reconstructing context: Evaluating advanced chunking strategies for retrieval-augmented generation,” 2025. [Online]. A vailable: https://arxiv.org/abs/2504.19754

work page arXiv 2025

[15] [15]

Vncorenlp: A vietnamese natural language processing toolkit,

T. Vu, D. Q. Nguyen, D. Q. Nguyen, M. Dras, and M. Johnson, “Vncorenlp: A vietnamese natural language processing toolkit,” in Proceedings of NAACL-HLT Demonstrations. Asso- ciation for Computational Linguistics, 2018, pp. 56–60

work page 2018

[16] [16]

Sea-helm: Southeast asian holistic evaluation of language models,

Y. Susanto, A. V. Hulagadri, J. R. Montalan, J. G. Ngui, X. B. Yong, W. Leong, H. Rengarajan, P. Limkonchotiwat, Y. Mai, and W. C. Tjhi, “Sea-helm: Southeast asian holistic evaluation of language models,” 2025. [Online]. A vailable: https://arxiv.org/abs/2502.14301

work page arXiv 2025

[17] [17]

Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models,

W. Q. Leong, J. G. Ngui, Y. Susanto, H. Rengarajan, K. Sarveswaran, and W. C. Tjhi, “Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models,” 2023. [Online]. A vailable: https://arxiv.org/abs/2309.06085

work page arXiv 2023

[18] [18]

Strauss, J

I. Strauss, J. Yang, T. O’Reilly, S. Rosenblat, and I. Moure, The Attribution Crisis in LLM Search Results: Estimating Ecosystem Exploitation , Jun. 2025. [Online]. A vailable: http://dx.doi.org/10.35650/AIDP.4114.d.2025

work page doi:10.35650/aidp.4114.d.2025 2025

[19] [19]

Making a miracl: Multilingual information retrieval across a continuum of languages,

X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo, D. Alfonso-Hermelo, X. Li, Q. Liu, M. Rezagholizadeh, and J. Lin, “Making a miracl: Multilingual information retrieval across a continuum of languages,” 2022. [Online]. A vailable: https://arxiv.org/abs/2210.09984

work page arXiv 2022

[20] [20]

Pre-trained language model for code- mixed text in indonesian, javanese, and english using transformer,

A. F. Hidayatullah, R. Apong, D. T. C. Lai, and A. Qazi, “Pre-trained language model for code- mixed text in indonesian, javanese, and english using transformer,” Social Network Analysis and Mining , vol. 15, 2025

work page 2025

[21] [21]

Lost in the Middle: How Language Models Use Long Contexts

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” arXiv preprint arXiv:2307.03172 , 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023