SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?

Ekapol Chuangsuwanich; Erik Cambria; Jann Railey Montalan; Jian Gang Ngui; Panuthep Tasawong; Peerat Limkonchotiwat; Raymond Ng; Sarana Nutanong; Thura Aung; William Chandra Tjhi

arxiv: 2508.12243 · v3 · submitted 2025-08-17 · 💻 cs.CL

SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?

Wuttikorn Ponwitayarat , Peerat Limkonchotiwat , Raymond Ng , Jann Railey Montalan , Thura Aung , Jian Gang Ngui , Yosephine Susanto , William Chandra Tjhi

show 4 more authors

Panuthep Tasawong Erik Cambria Ekapol Chuangsuwanich Sarana Nutanong

This is my paper

Pith reviewed 2026-05-18 22:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual embeddingsSoutheast Asian languagesbenchmarksemantic representationperformance variationlanguage tasksmodel evaluation

0 comments

The pith

Multilingual embedding models do not encode meaning in a stable semantic space for Southeast Asian languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the common assumption that multilingual embeddings create consistent, perspective-independent representations of meaning. It introduces SEA-BED, a benchmark that runs many embedding models on ten Southeast Asian languages across several different tasks. Results show that performance changes sharply depending on which language and which task are involved, and that good results on one combination do not carry over to others. No model achieves strong results everywhere. These patterns indicate that current embeddings contain real gaps in how they represent meaning for these languages.

Core claim

The central claim is that multilingual text embeddings do not produce stable similarity judgments across languages and tasks. Using the SEA-BED benchmark on ten Southeast Asian languages and diverse embedding tasks, the evaluations find that no single model performs uniformly well, that task difficulty varies markedly within each language, and that success on one task does not reliably predict success on others. Language-task analyses reveal highly non-uniform performance landscapes.

What carries the argument

The SEA-BED benchmark, which runs systematic evaluations of embedding performance across ten languages and multiple tasks to map variations in semantic representation.

If this is right

Performance must be measured across many languages and tasks together rather than relying on single scores.
Model development should incorporate data, algorithmic, and architectural choices that address observed language-task gaps.
Reliable use of embeddings for Southeast Asian languages requires checking specific combinations instead of assuming broad stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same uneven representation patterns may appear in other language families that are poorly covered in training data.
Targeted collection of task-specific data for the weakest language-task pairs could improve consistency.
Future benchmarks could add more fine-grained language and task pairs to locate representation gaps more precisely.

Load-bearing premise

The selected ten languages and range of tasks are representative enough to reveal general inconsistencies in how multilingual embeddings represent meaning.

What would settle it

Finding one embedding model that achieves consistently high performance across all ten languages and all tasks in the SEA-BED evaluation would show the reported non-uniformity does not hold.

read the original abstract

Multilingual text embeddings are often assumed to encode meaning in a perspective-independent semantic space, yielding stable similarity judgments across tasks and languages. Our results show that this assumption does not hold in practice. We introduce SEA-BED, a large-scale benchmark covering 10 Southeast Asian (SEA) languages and diverse embedding tasks, designed to systematically examine how embedding performance varies across tasks, languages, and language-task combinations. Across extensive evaluations, we observe that no single model performs uniformly well across SEA languages; task difficulty differs markedly within languages, and success on one task does not reliably generalize to others. Language-task analyses further reveal highly non-uniform performance landscapes, where performance varies across different language-task combinations. These findings call for closer attention to performance measurements that provide an expansive view across languages and tasks to uncover inconsistencies in semantic representation. Based on these observations, we provide insights for future model development, including data, algorithmic, and architectural considerations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEA-BED documents real performance variation across SEA languages and tasks, but the results look consistent with uneven pretraining data rather than a fundamental breakdown in shared semantic spaces.

read the letter

The paper introduces SEA-BED, a benchmark covering 10 Southeast Asian languages and a range of embedding tasks. It reports that no single model does well everywhere, that task difficulty shifts inside each language, and that good results on one task do not predict results on another. That pattern is the main new observation, and the benchmark itself is the concrete contribution. Prior work on multilingual embeddings has left SEA languages thinly covered, so a systematic set of measurements focused here is useful on its face. The authors also give some forward-looking notes on data, algorithms, and architecture that follow from the patterns they see. Those parts are straightforward and worth having on record. The soft spot is that the central interpretation still needs more support. The abstract and the stress-test note both point to the same issue: SEA languages differ sharply in pretraining exposure, script, and token counts. If the observed gaps track those differences, the findings fit ordinary data-scarcity effects and do not yet show that the models lack a perspective-independent semantic space. The paper would be stronger if it matched models on data volume, reported token statistics per language, or ran controls that isolate representation quality from resource level. Without those steps the claim that the assumption “does not hold in practice” rests on observational patterns rather than a direct test. This work is mainly for groups building or evaluating multilingual embeddings who need coverage of SEA languages. The benchmark could become a standard reference point if the datasets and splits are released cleanly. It is coherent enough and addresses a real gap, so it deserves a serious referee rather than a desk reject. I would send it out for review with the expectation that the data-volume question gets addressed in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SEA-BED, a benchmark spanning 10 Southeast Asian languages and multiple embedding tasks. It evaluates multilingual embedding models and reports non-uniform performance: no model excels across all languages, task difficulty varies within languages, and success on one task does not reliably predict success on others. The authors conclude that the assumption of a perspective-independent semantic space does not hold for these languages and provide recommendations for future model development.

Significance. If the reported patterns are shown to be robust after controlling for data imbalances and with full methodological transparency, the work would be significant for multilingual NLP. SEA-BED supplies a new evaluation resource focused on underrepresented languages, and the language-task interaction analysis could usefully inform targeted data and architectural improvements for low-resource settings.

major comments (3)

[Section 3] Section 3 (Benchmark Construction): The description of SEA-BED dataset curation, task selection, and data sourcing is insufficiently detailed. Without explicit information on how examples were collected, balanced, or quality-controlled across the 10 languages, it is impossible to rule out confounds that could produce the observed non-uniformity.
[Section 4] Section 4 (Results): Performance figures and language-task matrices are presented without statistical significance tests, confidence intervals, or error bars. This omission makes it difficult to assess whether the reported variations across languages and tasks exceed what would be expected from sampling noise alone.
[Section 5] Section 5 (Discussion): The interpretation that non-uniform accuracy demonstrates representational inconsistency does not address the plausible alternative that gaps track differences in pretraining data volume or token coverage for each SEA language. A post-hoc correlation between model performance and estimated pretraining exposure per language would be required to support the stronger claim.

minor comments (2)

[Abstract] Abstract: The phrase 'extensive evaluations' should be accompanied by the exact number of models and tasks evaluated to give readers an immediate sense of scale.
[Introduction] Notation: Define all acronyms (e.g., SEA-BED) at first use in the main text even if already expanded in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript introducing the SEA-BED benchmark. We have carefully reviewed each major comment and provide point-by-point responses below, indicating planned revisions to improve clarity, rigor, and transparency.

read point-by-point responses

Referee: [Section 3] Section 3 (Benchmark Construction): The description of SEA-BED dataset curation, task selection, and data sourcing is insufficiently detailed. Without explicit information on how examples were collected, balanced, or quality-controlled across the 10 languages, it is impossible to rule out confounds that could produce the observed non-uniformity.

Authors: We agree that greater methodological transparency is needed. In the revised manuscript, we will substantially expand Section 3 to include: (1) detailed sourcing information for each task and language (including original data providers and any translation or adaptation steps); (2) explicit balancing criteria used to ensure comparable example counts and difficulty distributions across the 10 languages; and (3) quality-control procedures, such as automated filtering rules, human verification protocols, and inter-annotator agreement metrics where applicable. These additions will allow readers to better assess potential confounds. revision: yes
Referee: [Section 4] Section 4 (Results): Performance figures and language-task matrices are presented without statistical significance tests, confidence intervals, or error bars. This omission makes it difficult to assess whether the reported variations across languages and tasks exceed what would be expected from sampling noise alone.

Authors: We acknowledge this limitation in statistical reporting. We will update all performance tables and figures in Section 4 to include 95% confidence intervals (computed via bootstrapping) and error bars. In addition, we will report results of statistical tests (e.g., Friedman tests for overall language and task effects followed by post-hoc Wilcoxon signed-rank tests with Bonferroni correction) to evaluate whether observed differences across languages and tasks are statistically significant beyond sampling variability. revision: yes
Referee: [Section 5] Section 5 (Discussion): The interpretation that non-uniform accuracy demonstrates representational inconsistency does not address the plausible alternative that gaps track differences in pretraining data volume or token coverage for each SEA language. A post-hoc correlation between model performance and estimated pretraining exposure per language would be required to support the stronger claim.

Authors: Our primary claim concerns the empirical observation of non-uniform performance across languages and tasks, which already challenges the assumption of a perspective-independent semantic space. We agree, however, that discussing alternative explanations strengthens the paper. In the revision we will add a dedicated paragraph in Section 5 that (a) acknowledges pretraining data imbalance as a plausible contributing factor and (b) presents a post-hoc analysis correlating model performance with available proxies for language-specific pretraining exposure (e.g., cited web-crawl statistics and model documentation). Because exact token counts for proprietary models are not publicly disclosed, the analysis will necessarily rely on these proxies and will be presented as suggestive rather than definitive. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain or self-referential steps

full rationale

The paper introduces SEA-BED as a new benchmark dataset and reports direct performance measurements of existing embedding models across 10 languages and multiple tasks. All claims rest on observed accuracy numbers, language-task interaction patterns, and qualitative summaries of those measurements. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes are defined or invoked; the central finding (non-uniform performance) is a direct empirical observation rather than a reduction of any prior result to itself. Self-citations, if present in the full text, are not load-bearing for the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmarking paper that relies on standard NLP evaluation practices without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)

domain assumption Standard embedding evaluation tasks (similarity, retrieval, classification) measure semantic representation quality
Benchmark design depends on this to interpret performance differences as evidence about semantic spaces.

pith-pipeline@v0.9.0 · 5745 in / 1077 out tokens · 48577 ms · 2026-05-18T22:33:10.249024+00:00 · methodology

SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)