SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?

· 2025 · cs.CL · arXiv 2508.12243

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Multilingual text embeddings are often assumed to encode meaning in a perspective-independent semantic space, yielding stable similarity judgments across tasks and languages. Our results show that this assumption does not hold in practice. We introduce SEA-BED, a large-scale benchmark covering 10 Southeast Asian (SEA) languages and diverse embedding tasks, designed to systematically examine how embedding performance varies across tasks, languages, and language-task combinations. Across extensive evaluations, we observe that no single model performs uniformly well across SEA languages; task difficulty differs markedly within languages, and success on one task does not reliably generalize to others. Language-task analyses further reveal highly non-uniform performance landscapes, where performance varies across different language-task combinations. These findings call for closer attention to performance measurements that provide an expansive view across languages and tasks to uncover inconsistencies in semantic representation. Based on these observations, we provide insights for future model development, including data, algorithmic, and architectural considerations.

representative citing papers

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.

citing papers explorer

Showing 1 of 1 citing paper after filters.

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia cs.CL · 2026-06-02 · unverdicted · none · ref 38 · internal anchor
SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.

SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?

fields

years

verdicts

representative citing papers

citing papers explorer