SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

Jian Gang Ngui; Peerat Limkonchotiwat; Raymond Ng; Sarana Nutanong

arxiv: 2606.03027 · v1 · pith:KFFHRESAnew · submitted 2026-06-02 · 💻 cs.CL

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

Peerat Limkonchotiwat , Raymond Ng , Sarana Nutanong , Jian Gang Ngui This is my paper

classification 💻 cs.CL

keywords reproducibledataembeddingsrobustsea-embeddingsoutheasttextasian

0 comments

read the original abstract

Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed training data, and they remain insufficiently robust for Southeast Asian languages. We present SEA-Embedding, a fully open and reproducible text-embedding pipeline for Southeast Asian languages trained only on publicly available data, and use it to study three core factors of robust embedding design: data composition, training objective, and base encoder initialization. SEA-Embedding achieves state-of-the-art results on SEA-BED while enabling systematic and reproducible analysis of robust text embeddings for the region.

This paper has not been read by Pith yet.

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

discussion (0)