SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

Aulia Adila; Kittiphat Leesombatwathana; My Chiffon Nguyen; Patomporn Payoungkhamdee; Saksorn Ruangtanusak; Samuel Cahyawijaya; Vissuta Gunawan Lim

arxiv: 2606.28715 · v1 · pith:VSPQGJLZnew · submitted 2026-06-27 · 💻 cs.CL · cs.AI

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

My Chiffon Nguyen , Aulia Adila , Saksorn Ruangtanusak , Kittiphat Leesombatwathana , Vissuta Gunawan Lim , Patomporn Payoungkhamdee , Samuel Cahyawijaya This is my paper

Pith reviewed 2026-06-30 10:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords SEATauBenchTauBench adaptationmultilingual agentsSoutheast Asian languagestool-using agentsagent evaluationlanguage transferdomain localization

0 comments

The pith

Adapting TauBench to five Southeast Asian languages shows agent performance holds for conversation-language changes but drops sharply with localized tools and domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SEATauBench by adapting TauBench into Mandarin, Vietnamese, Thai, Indonesian, and Filipino, then runs the same agents across three increasing levels of localization: conversation language only, language plus tool specifications, and full language plus tools plus task domains. It reports that English-trained agent capabilities transfer with little loss when only the user conversation language changes, yet both quality and robustness fall as more elements localize, with the steepest declines under complete domain adaptation. This pattern indicates that English-only agent tests miss real gaps in regional-language settings and supplies a reusable pipeline for diagnosing and improving multilingual agent reliability in linguistically diverse areas.

Core claim

SEATauBench adapts TauBench to five SEA languages and evaluates agents in progressively localized conditions that change the language of user-agent interaction, the tool specifications, and the underlying task domains. Across three recent models the results show English agent capabilities transfer reasonably well when only the conversation language is altered, but quality and robustness degrade sharply once tool specifications and task domains are also localized, with the largest losses appearing under full domain adaptation. The work therefore demonstrates the limits of English-only assessment for measuring agent capabilities in SEA languages and supplies both a diagnostic benchmark and an

What carries the argument

The progressive localization pipeline of SEATauBench, which independently varies conversation language, tool specifications, and task domains while keeping the underlying agent evaluation structure fixed.

If this is right

Agent robustness remains high when only the conversation language changes.
Performance and robustness losses increase steadily as tool specifications are localized.
The largest performance drops occur under full domain adaptation.
English-only evaluations cannot reliably measure agent capabilities in the target SEA languages.
The adaptation pipeline can be reused to create similar benchmarks for other languages or regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same localization pattern may appear in other low-resource language families outside Southeast Asia.
Models fine-tuned on localized domain data might reduce the observed drops in full-adaptation settings.
Native-speaker review of the adapted tasks could be added as a standard validation step in future extensions of the benchmark.

Load-bearing premise

The localized task versions preserve the original difficulty, tool semantics, and evaluation validity without introducing translation or domain-shift artifacts.

What would settle it

Re-running the full set of localized tasks with native-speaker validation and finding materially different difficulty ratings or altered tool behaviors relative to the English originals would undermine the measured performance gaps.

Figures

Figures reproduced from arXiv: 2606.28715 by Aulia Adila, Kittiphat Leesombatwathana, My Chiffon Nguyen, Patomporn Payoungkhamdee, Saksorn Ruangtanusak, Samuel Cahyawijaya, Vissuta Gunawan Lim.

**Figure 2.** Figure 2: Overview of the automated translation pipeline for generating multilingual [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Evaluation results on SEATauBench. SEATauBench reveals consistent degradation as the setting becomes [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Quality and robustness of varying agent models across languages and domains. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Error categorization for simulated user and agent, across SEA language in the L2 Interaction and L2 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Pass@1 and ρ 3 as the number of languages used for translated tools increase from 1 (English) to 5. Performance for two agents (GPT-5-Mini & Qwen3- 235B-A22B-Inst) slightly drops when a second language is added, but it plateaus for more languages. both analyzed scenarios. Our main finding that agent performance drops as L2 adaptation increases across four scenarios ( [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗

**Figure 7.** Figure 7: Only 1.4% of the variation in task performance [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Correlation of pass@1 (upper triangule) and [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Average turn-level agent language correctness [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Proportion of non-L2 turns as a function of [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Per-task share of agent turns emitted in En [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

While AI development and evaluation for Southeast Asia (SEA) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI. To fill this gap, we introduce SEATauBench, the first agent-focused evaluation framework for SEA sovereign AI. SeaTau adapts TauBench to five languages -- Mandarin, Vietnamese, Thai, Indonesian, and Filipino -- and evaluates agents across progressively localized settings that vary the language of user-agent interaction, tool specifications, and task domains. Across three recent models, we find that English agent capabilities transfer reasonably well when only the conversation language changes, but quality and robustness degrade sharply as more task contexts are localized, with the largest losses in full domain adaptation. We also the limits of English-only agent assessment for measuring agent capabilities in SEA languages. More broadly, SeaTau provides a diagnostic benchmark and reusable adaptation pipeline for building reliable multilingual agents for linguistically diverse regions. Data and code can be accessed at github.com/SEACrowd/SEATauBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEATauBench gives a usable first look at how agent performance drops with deeper localization in five SEA languages, backed by a public adaptation pipeline.

read the letter

The main takeaway is that English agent skills hold when only the conversation language switches but fall off once tool specs and task domains get localized too. The paper adapts TauBench across Mandarin, Vietnamese, Thai, Indonesian, and Filipino in three progressive settings and reports the pattern across three recent models.

What is new is the benchmark itself plus the explicit multi-stage localization pipeline with human validation on tool semantics and domain fit. Releasing the code and data on GitHub is a clear plus; it lets others check or extend the work without starting from scratch.

The central claim rests on the localized tasks keeping roughly the same difficulty and meaning. The methods section apparently includes human review steps for exactly that, which reduces the usual translation-artifact worry. Still, the abstract gives no task counts, exact metrics, or statistical tests, so the size of the reported degradation is hard to judge from the summary alone.

This is aimed at people building or evaluating agents for linguistically diverse regions. Anyone studying transfer or sovereign AI setups could get value from the diagnostic framing. The work shows clear thinking on the evaluation design and engages the gap directly.

I would send it for peer review. The empirical setup is reproducible enough to merit referee time even if the numbers need tightening.

Referee Report

0 major / 1 minor

Summary. The manuscript introduces SEATauBench, the first agent-focused evaluation framework for Southeast Asian languages, by adapting TauBench to Mandarin, Vietnamese, Thai, Indonesian, and Filipino. Agents are evaluated across progressively localized settings that vary conversation language, tool specifications, and task domains. Across three recent models, the central claim is that English agent capabilities transfer reasonably well when only the conversation language changes, but quality and robustness degrade sharply as more task contexts are localized, with the largest losses under full domain adaptation. The work also highlights limits of English-only assessment for SEA languages and releases a diagnostic benchmark plus reusable adaptation pipeline, with data and code publicly available.

Significance. If the empirical patterns hold, the work is significant for sovereign AI efforts in linguistically diverse regions by providing concrete evidence that English-centric agent evaluations are insufficient for SEA languages and by supplying a reusable multi-stage adaptation pipeline with reported human validation. The public release of data and code strengthens reproducibility and enables follow-on work.

minor comments (1)

[Abstract] Abstract: the sentence beginning 'We also the limits of English-only agent assessment...' is missing a verb and should be revised for grammatical completeness (e.g., 'We also demonstrate the limits...').

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of SEATauBench and for recommending minor revision. The provided summary correctly reflects our central findings on language transfer versus domain localization in agent evaluations for Southeast Asian languages. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper performs direct empirical evaluation by adapting TauBench into SEATauBench via an explicit multi-stage localization pipeline (language, tool specs, domains) with reported human validation. Central claims are observational results on model performance degradation across localization levels, not derived predictions, fitted parameters renamed as outputs, or self-referential definitions. No equations, uniqueness theorems, or load-bearing self-citations appear in the provided text; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no explicit free parameters, axioms, or invented entities; the contribution is an empirical benchmark adaptation and evaluation pipeline.

pith-pipeline@v0.9.1-grok · 5745 in / 935 out tokens · 42136 ms · 2026-06-30T10:10:31.840158+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 27 canonical work pages · 4 internal anchors

[1]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

2026
[2]

2026 , eprint=

OpenAI GPT-5 System Card , author=. 2026 , eprint=

2026
[3]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[4]

Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLM s

Chae, Kyubyung and Kim, Gihoon and Lee, Gyuseong and Kim, Taesup and Lee, Jaejin and Kim, Heejin. Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLM s. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.559

work page doi:10.18653/v1/2025.findings-emnlp.559 2025
[5]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

2024 , eprint=

-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. 2024 , eprint=

2024
[9]

2026 , url=

Alejandro Cuadron and Pengfei Yu and Yang Liu and Arpit Gupta , booktitle=. 2026 , url=

2026
[10]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li and Yingxiu Zhao and Bowen Yu and Feifan Song and Hangyu Li and Haiyang Yu and Zhoujun Li and Fei Huang and Yongbin Li , year=. 2304.08244 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , year=. Gorilla: Large Language Model Connected with Massive. 2305.15334 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Patil and Huanzhi Mao and Charlie Cheng-Jie Ji and Fanjia Yan and Vishnu Suresh and Ion Stoica and Joseph E

Shishir G. Patil and Huanzhi Mao and Charlie Cheng-Jie Ji and Fanjia Yan and Vishnu Suresh and Ion Stoica and Joseph E. Gonzalez , booktitle=. The Berkeley Function Calling Leaderboard (. 2025 , url=

2025
[13]

2025 , url=

Jiarui Lu and Thomas Holleis and Yizhe Zhang and Bernhard Aumayer and Feng Nan and Haoping Bai and Shuang Ma and Shen Ma and Mengyu Li and Guoli Yin and Zirui Wang and Ruoming Pang , booktitle=. 2025 , url=

2025
[14]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , year=. 2307.13854 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh and Robert Lo and Lawrence Jang and Vikram Duvvur and Ming Chong Lim and Po-Yu Huang and Graham Neubig and Shuyan Zhou and Ruslan Salakhutdinov and Daniel Fried , year=. 2401.13649 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

2026 , eprint=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=

2026
[17]

2024 , url=

Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , booktitle=. 2024 , url=

2024
[18]

Windows agent arena: Evaluating multi-modal os agents at scale, 2024

Rogerio Bonatti and Dan Zhao and Francesco Bonacci and Dillon Dupont and Sara Abdali and Yinheng Li and Justin Wagle and Kazuhito Koishida and Arthur Bucker and Lawrence Jang and Zack Hui , year=. Windows Agent Arena: Evaluating Multi-Modal. 2409.08264 , archivePrefix=

work page arXiv
[19]

2502.14301 , archivePrefix=

Yosephine Susanto and Adithya Venkatadri Hulagadri and Jann Railey Montalan and Jian Gang Ngui and Xian Bin Yong and Weiqi Leong and Hamsawardhini Rengarajan and Peerat Limkonchotiwat and Yifan Mai and William Chandra Tjhi , year=. 2502.14301 , archivePrefix=

work page arXiv
[20]

2502.06298 , archivePrefix=

Chaoqun Liu and Wenxuan Zhang and Jiahao Ying and Mahani Aljunied and Anh Tuan Luu and Lidong Bing , year=. 2502.06298 , archivePrefix=

work page arXiv
[21]

2021 , url=

Li, Haoran and Arora, Abhinav and Chen, Shuohui and Gupta, Anchit and Gupta, Sonal and Mehdad, Yashar , booktitle=. 2021 , url=

2021
[22]

2023 , url=

Moradshahi, Mehrad and Shen, Tianhao and Bali, Kalika and Choudhury, Monojit and de Chalendar, Gael and Goel, Anmol and Kim, Sungkyun and Kodali, Prashant and Kumaraguru, Ponnurangam and Semmar, Nasredine and Semnani, Sina and Seo, Jiwon and Seshadri, Vivek and Shrivastava, Manish and Sun, Michael and Yadavalli, Aditya and You, Chaobin and Xiong, Deyi and...

2023
[23]

2025 , url=

Kulkarni, Mayank and Mazzia, Vittorio and Gaspers, Judith and Hench, Chris and FitzGerald, Jack , booktitle=. 2025 , url=

2025
[24]

2508.12459 , archivePrefix=

Alham Fikri Aji and Trevor Cohn , year=. 2508.12459 , archivePrefix=

work page arXiv
[25]

2026 , eprint=

-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge , author=. 2026 , eprint=

2026
[26]

M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling

Budzianowski, Pawe and Wen, Tsung-Hsien and Tseng, Bo-Hsiang and Casanueva, I \ n igo and Ultes, Stefan and Ramadan, Osman and Ga s i \'c , Milica. M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/...

work page doi:10.18653/v1/d18-1547 2018
[27]

M ulti WOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines

Eric, Mihail and Goel, Rahul and Paul, Shachi and Sethi, Abhishek and Agarwal, Sanchit and Gao, Shuyang and Kumar, Adarsh and Goyal, Anuj and Ku, Peter and Hakkani-Tur, Dilek. M ulti WOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines. Proceedings of the Twelfth Language Resources and Evaluation Confer...

2020
[28]

M ulti WOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines

Zang, Xiaoxue and Rastogi, Abhinav and Sunkara, Srinivas and Gupta, Raghav and Zhang, Jianguo and Chen, Jindong. M ulti WOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines. Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. 2020. doi:10.18653/v1/2020.nlp4convai-1.13

work page doi:10.18653/v1/2020.nlp4convai-1.13 2020
[29]

MultiWOZ 2.3: A Multi-domain Task-Oriented Dialogue Dataset Enhanced with Annotation Corrections and Co-Reference Annotation

Han, Ting and Liu, Ximing and Takanabu, Ryuichi and Lian, Yixin and Huang, Chongxuan and Wan, Dazhen and Peng, Wei and Huang, Minlie. MultiWOZ 2.3: A Multi-domain Task-Oriented Dialogue Dataset Enhanced with Annotation Corrections and Co-Reference Annotation. Natural Language Processing and Chinese Computing. 2021

2021
[30]

M ulti WOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset with Essential Annotation Corrections to Improve State Tracking Evaluation

Ye, Fanghua and Manotumruksa, Jarana and Yilmaz, Emine. M ulti WOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset with Essential Annotation Corrections to Improve State Tracking Evaluation. Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2022. doi:10.18653/v1/2022.sigdial-1.34

work page doi:10.18653/v1/2022.sigdial-1.34 2022
[31]

MASSIVE : A 1 M -Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages

FitzGerald, Jack and Hench, Christopher and Peris, Charith and Mackie, Scott and Rottmann, Kay and Sanchez, Ana and Nash, Aaron and Urbach, Liam and Kakarala, Vishesh and Singh, Richa and Ranganath, Swetha and Crist, Laurie and Britan, Misha and Leeuwis, Wouter and Tur, Gokhan and Natarajan, Prem. MASSIVE : A 1 M -Example Multilingual Natural Language Und...

work page doi:10.18653/v1/2023.acl-long.235 2023
[32]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , url =

Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Hong, Lauren and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and li, dahai and Liu, Zhiyuan and Sun, Maosong , booktitle =. ToolLLM: Facilitating Large Language Models t...
[33]

CRMA rena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

Huang, Kung-Hsiang and Prabhakar, Akshara and Dhawan, Sidharth and Mao, Yixin and Wang, Huan and Savarese, Silvio and Xiong, Caiming and Laban, Philippe and Wu, Chien-Sheng. CRMA rena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments. Proceedings of the 2025 Conference of the Nations of the Americas Chap...

work page doi:10.18653/v1/2025.naacl-long.194 2025
[34]

Frank F. Xu and Yufan Song and Boxuan Li and Yuxuan Tang and Kritanjali Jain and Mengxue Bao and Zora Zhiruo Wang and Xuhui Zhou and Zhitong Guo and Murong Cao and Mingyang Yang and Hao Yang Lu and Amaad Martin and Zhe Su and Leander Melroy Maben and Raj Mehta and Wayne Chi and Lawrence Keunho Jang and Yiqing Xie and Shuyan Zhou and Graham Neubig , bookti...

2026
[35]

2024 , eprint=

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , author=. 2024 , eprint=

2024
[36]

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks , url =

Boisvert, L\'. WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks , url =. Advances in Neural Information Processing Systems , doi =
[37]

Kim and Samuel Miserendino and Gildas Chabot and David Li and Patrick Chao and Michael Sharman and Alexandra Barr and Amelia Glaese and Jerry Tworek , booktitle=

Tejal Patwardhan and Rachel Dias and Elizabeth Proehl and Grace Kim and Michele Wang and Olivia Watkins and Simon Posada Fishman and Marwan Aljubeh and Phoebe Thacker and Laurance Fauconnet and Natalie S. Kim and Samuel Miserendino and Gildas Chabot and David Li and Patrick Chao and Michael Sharman and Alexandra Barr and Amelia Glaese and Jerry Tworek , b...

2026
[38]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
[39]

Global MMLU : Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Singh, Shivalika and Romanou, Angelika and Fourrier, Cl \'e mentine and Adelani, David Ifeoluwa and Ngui, Jian Gang and Vila-Suero, Daniel and Limkonchotiwat, Peerat and Marchisio, Kelly and Leong, Wei Qi and Susanto, Yosephine and Ng, Raymond and Longpre, Shayne and Ruder, Sebastian and Ko, Wei-Yin and Bosselut, Antoine and Oh, Alice and Martins, Andre a...

work page doi:10.18653/v1/2025.acl-long.919 2025
[40]

G lot E val: A Test Suite for Massively Multilingual Evaluation of Large Language Models

Luo, Hengyu and Li, Zihao and Attieh, Joseph and Devkota, Sawal and de Gibert, Ona and Huang, Xu and Ji, Shaoxiong and Lin, Peiqin and Mantina, Bhavani Sai Praneeth Varma and Sreenidhi, Ananda and V. G lot E val: A Test Suite for Massively Multilingual Evaluation of Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural ...

work page doi:10.18653/v1/2025.emnlp-demos.43 2025
[41]

2024 , eprint=

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark , author=. 2024 , eprint=

2024
[42]

2025 , howpublished =

2025
[43]

N usa X : Multilingual Parallel Sentiment Dataset for 10 I ndonesian Local Languages

Winata, Genta Indra and Aji, Alham Fikri and Cahyawijaya, Samuel and Mahendra, Rahmad and Koto, Fajri and Romadhony, Ade and Kurniawan, Kemal and Moeljadi, David and Prasojo, Radityo Eko and Fung, Pascale and Baldwin, Timothy and Lau, Jey Han and Sennrich, Rico and Ruder, Sebastian. N usa X : Multilingual Parallel Sentiment Dataset for 10 I ndonesian Loca...

work page doi:10.18653/v1/2023.eacl-main.57 2023
[44]

N usa W rites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

Cahyawijaya, Samuel and Lovenia, Holy and Koto, Fajri and Adhista, Dea and Dave, Emmanuel and Oktavianti, Sarah and Akbar, Salsabil and Lee, Jhonson and Shadieq, Nuur and Cenggoro, Tjeng Wawan and Linuwih, Hanung and Wilie, Bryan and Muridan, Galih and Winata, Genta and Moeljadi, David and Aji, Alham Fikri and Purwarianti, Ayu and Fung, Pascale. N usa W r...

work page doi:10.18653/v1/2023.ijcnlp-main.60 2023
[45]

N usa D ialogue: Dialogue Summarization and Generation for Underrepresented and Extremely Low-Resource Languages

Purwarianti, Ayu and Adhista, Dea and Baptiso, Agung and Mahfuzh, Miftahul and Sabila, Yusrina and Adila, Aulia and Cahyawijaya, Samuel and Aji, Alham Fikri. N usa D ialogue: Dialogue Summarization and Generation for Underrepresented and Extremely Low-Resource Languages. Proceedings of the Second Workshop in South East Asian Language Processing. 2025

2025
[46]

I ndo T o D : A Multi-Domain I ndonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

Kautsar, Muhammad and Nurdini, Rahmah and Cahyawijaya, Samuel and Winata, Genta and Purwarianti, Ayu. I ndo T o D : A Multi-Domain I ndonesian Benchmark For End-to-End Task-Oriented Dialogue Systems. Proceedings of the First Workshop in South East Asian Language Processing. 2023. doi:10.18653/v1/2023.sealp-1.7

work page doi:10.18653/v1/2023.sealp-1.7 2023
[47]

and Santoso, Jennifer and Aco, Elyanah and Fadhilah, Akhdan and Mansurov, Jonibek and Imperial, Joseph Marvin and Kampman, Onno P

Lovenia, Holy and Mahendra, Rahmad and Akbar, Salsabil Maulana and Miranda, Lester James V. and Santoso, Jennifer and Aco, Elyanah and Fadhilah, Akhdan and Mansurov, Jonibek and Imperial, Joseph Marvin and Kampman, Onno P. and Moniz, Joel Ruben Antony and Habibi, Muhammad Ravi Shulthan and Hudi, Frederikus and Montalan, Railey and Ignatius, Ryan and Lopo,...

work page doi:10.18653/v1/2024.emnlp-main.296 2024
[48]

Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Senses

Cahyawijaya, Samuel and Zhang, Ruochen and Cruz, Jan Christian Blaise and Lovenia, Holy and Gilbert, Elisa and Nomoto, Hiroki and Aji, Alham Fikri. Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Senses. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.f...

work page doi:10.18653/v1/2025.findings-naacl.178 2025
[49]

S ea E xam and S ea B ench: Benchmarking LLM s with Local Multilingual Questions in S outheast A sia

Liu, Chaoqun and Zhang, Wenxuan and Ying, Jiahao and Aljunied, Mahani and Luu, Anh Tuan and Bing, Lidong. S ea E xam and S ea B ench: Benchmarking LLM s with Local Multilingual Questions in S outheast A sia. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.341

work page doi:10.18653/v1/2025.findings-naacl.341 2025
[50]

SEA - HELM : S outheast A sian Holistic Evaluation of Language Models

Susanto, Yosephine and Hulagadri, Adithya Venkatadri and Montalan, Jann Railey and Ngui, Jian Gang and Yong, Xianbin and Leong, Wei Qi and Rengarajan, Hamsawardhini and Limkonchotiwat, Peerat and Mai, Yifan and Tjhi, William Chandra. SEA - HELM : S outheast A sian Holistic Evaluation of Language Models. Findings of the Association for Computational Lingui...

work page doi:10.18653/v1/2025.findings-acl.636 2025
[51]

N usa C rowd: Open Source Initiative for I ndonesian NLP Resources

Cahyawijaya, Samuel and Lovenia, Holy and Aji, Alham Fikri and Winata, Genta and Wilie, Bryan and Koto, Fajri and Mahendra, Rahmad and Wibisono, Christian and Romadhony, Ade and Vincentio, Karissa and Santoso, Jennifer and Moeljadi, David and Wirawan, Cahya and Hudi, Frederikus and Wicaksono, Muhammad Satrio and Parmonangan, Ivan and Alfina, Ika and Putra...

work page doi:10.18653/v1/2023.findings-acl.868 2023
[52]

and Han, Kenneth Chen Ko and Santos, Anjela Gail D

Cahyawijaya, Samuel and Lovenia, Holy and Moniz, Joel Ruben Antony and Wong, Tack Hwa and Farhansyah, Mohammad Rifqi and Maung, Thant Thiri and Hudi, Frederikus and Anugraha, David and Habibi, Muhammad Ravi Shulthan and Qorib, Muhammad Reza and Agarwal, Amit and Imperial, Joseph Marvin and Patel, Hitesh Laxmichand and Feliren, Vicky and Nasution, Bahrul I...

work page doi:10.18653/v1/2025.acl-long.916 2025
[53]

2026 , eprint=

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model , author=. 2026 , eprint=

2026
[54]

SEA - VQA : S outheast A sian Cultural Context Dataset For Visual Question Answering

Urailertprasert, Norawit and Limkonchotiwat, Peerat and Suwajanakorn, Supasorn and Nutanong, Sarana. SEA - VQA : S outheast A sian Cultural Context Dataset For Visual Question Answering. Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR). 2024. doi:10.18653/v1/2024.alvr-1.15

work page doi:10.18653/v1/2024.alvr-1.15 2024
[55]

2024 , howpublished =

2024
[56]

2025 , eprint=

^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. 2025 , eprint=

2025
[57]

2026 , eprint=

-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains , author=. 2026 , eprint=

2026
[58]

Position: The Right to

Rashid Mushkani and Hugo Berard and Allison Cohen and Shin Koseki , booktitle=. Position: The Right to. 2025 , url=

2025
[59]

2026 , url =

Barasa, Hilda and Tay, PeiChin and McBride, Keegan and Iosad, Alexander and M. 2026 , url =

2026
[60]

Sovereignty in the Age of AI: Strategic Choices, Structural Dependencies , month = jan, year =
[61]

2026 , month = apr, day =

Bhandari, Manik and Modi, Gaurav , title =. 2026 , month = apr, day =

2026
[62]

Why Sovereign Artificial Intelligence Is Imperative in Southeast Asia , year =
[63]

Frontiers in Artificial Intelligence , VOLUME=

Putra, Bama Andika , TITLE=. Frontiers in Artificial Intelligence , VOLUME=. 2024 , URL=. doi:10.3389/frai.2024.1411838 , ISSN=

work page doi:10.3389/frai.2024.1411838 2024
[64]

2024 , url =

Wenxuan Zhang and Hou Pong Chan and Yiran Zhao and Mahani Aljunied and Jianyu Wang and Chaoqun Liu and Yue Deng and Zhiqiang Hu and Weiwen Xu and Yew Ken Chia and Xin Li and Lidong Bing , title =. 2024 , url =

2024
[65]

2024 , booktitle =

Xuan-Phi Nguyen and Wenxuan Zhang and Xin Li and Mahani Aljunied and Zhiqiang Hu and Chenhui Shen and Yew Ken Chia and Xingxuan Li and Jianyu Wang, Qingyu Tan and Liying Cheng and Guanzheng Chen and Yue Deng and Sen Yang and Chaoqun Liu and Hang Zhang and Lidong Bing , title =. 2024 , booktitle =

2024
[66]

2026 , eprint=

Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations , author=. 2026 , eprint=

2026

[1] [1]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

2026

[2] [2]

2026 , eprint=

OpenAI GPT-5 System Card , author=. 2026 , eprint=

2026

[3] [3]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[4] [4]

Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLM s

Chae, Kyubyung and Kim, Gihoon and Lee, Gyuseong and Kim, Taesup and Lee, Jaejin and Kim, Heejin. Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLM s. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.559

work page doi:10.18653/v1/2025.findings-emnlp.559 2025

[5] [5]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

2024 , eprint=

-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. 2024 , eprint=

2024

[9] [9]

2026 , url=

Alejandro Cuadron and Pengfei Yu and Yang Liu and Arpit Gupta , booktitle=. 2026 , url=

2026

[10] [10]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li and Yingxiu Zhao and Bowen Yu and Feifan Song and Hangyu Li and Haiyang Yu and Zhoujun Li and Fei Huang and Yongbin Li , year=. 2304.08244 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , year=. Gorilla: Large Language Model Connected with Massive. 2305.15334 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Patil and Huanzhi Mao and Charlie Cheng-Jie Ji and Fanjia Yan and Vishnu Suresh and Ion Stoica and Joseph E

Shishir G. Patil and Huanzhi Mao and Charlie Cheng-Jie Ji and Fanjia Yan and Vishnu Suresh and Ion Stoica and Joseph E. Gonzalez , booktitle=. The Berkeley Function Calling Leaderboard (. 2025 , url=

2025

[13] [13]

2025 , url=

Jiarui Lu and Thomas Holleis and Yizhe Zhang and Bernhard Aumayer and Feng Nan and Haoping Bai and Shuang Ma and Shen Ma and Mengyu Li and Guoli Yin and Zirui Wang and Ruoming Pang , booktitle=. 2025 , url=

2025

[14] [14]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , year=. 2307.13854 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh and Robert Lo and Lawrence Jang and Vikram Duvvur and Ming Chong Lim and Po-Yu Huang and Graham Neubig and Shuyan Zhou and Ruslan Salakhutdinov and Daniel Fried , year=. 2401.13649 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

2026 , eprint=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=

2026

[17] [17]

2024 , url=

Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , booktitle=. 2024 , url=

2024

[18] [18]

Windows agent arena: Evaluating multi-modal os agents at scale, 2024

Rogerio Bonatti and Dan Zhao and Francesco Bonacci and Dillon Dupont and Sara Abdali and Yinheng Li and Justin Wagle and Kazuhito Koishida and Arthur Bucker and Lawrence Jang and Zack Hui , year=. Windows Agent Arena: Evaluating Multi-Modal. 2409.08264 , archivePrefix=

work page arXiv

[19] [19]

2502.14301 , archivePrefix=

Yosephine Susanto and Adithya Venkatadri Hulagadri and Jann Railey Montalan and Jian Gang Ngui and Xian Bin Yong and Weiqi Leong and Hamsawardhini Rengarajan and Peerat Limkonchotiwat and Yifan Mai and William Chandra Tjhi , year=. 2502.14301 , archivePrefix=

work page arXiv

[20] [20]

2502.06298 , archivePrefix=

Chaoqun Liu and Wenxuan Zhang and Jiahao Ying and Mahani Aljunied and Anh Tuan Luu and Lidong Bing , year=. 2502.06298 , archivePrefix=

work page arXiv

[21] [21]

2021 , url=

Li, Haoran and Arora, Abhinav and Chen, Shuohui and Gupta, Anchit and Gupta, Sonal and Mehdad, Yashar , booktitle=. 2021 , url=

2021

[22] [22]

2023 , url=

Moradshahi, Mehrad and Shen, Tianhao and Bali, Kalika and Choudhury, Monojit and de Chalendar, Gael and Goel, Anmol and Kim, Sungkyun and Kodali, Prashant and Kumaraguru, Ponnurangam and Semmar, Nasredine and Semnani, Sina and Seo, Jiwon and Seshadri, Vivek and Shrivastava, Manish and Sun, Michael and Yadavalli, Aditya and You, Chaobin and Xiong, Deyi and...

2023

[23] [23]

2025 , url=

Kulkarni, Mayank and Mazzia, Vittorio and Gaspers, Judith and Hench, Chris and FitzGerald, Jack , booktitle=. 2025 , url=

2025

[24] [24]

2508.12459 , archivePrefix=

Alham Fikri Aji and Trevor Cohn , year=. 2508.12459 , archivePrefix=

work page arXiv

[25] [25]

2026 , eprint=

-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge , author=. 2026 , eprint=

2026

[26] [26]

M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling

Budzianowski, Pawe and Wen, Tsung-Hsien and Tseng, Bo-Hsiang and Casanueva, I \ n igo and Ultes, Stefan and Ramadan, Osman and Ga s i \'c , Milica. M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/...

work page doi:10.18653/v1/d18-1547 2018

[27] [27]

M ulti WOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines

Eric, Mihail and Goel, Rahul and Paul, Shachi and Sethi, Abhishek and Agarwal, Sanchit and Gao, Shuyang and Kumar, Adarsh and Goyal, Anuj and Ku, Peter and Hakkani-Tur, Dilek. M ulti WOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines. Proceedings of the Twelfth Language Resources and Evaluation Confer...

2020

[28] [28]

M ulti WOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines

Zang, Xiaoxue and Rastogi, Abhinav and Sunkara, Srinivas and Gupta, Raghav and Zhang, Jianguo and Chen, Jindong. M ulti WOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines. Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. 2020. doi:10.18653/v1/2020.nlp4convai-1.13

work page doi:10.18653/v1/2020.nlp4convai-1.13 2020

[29] [29]

MultiWOZ 2.3: A Multi-domain Task-Oriented Dialogue Dataset Enhanced with Annotation Corrections and Co-Reference Annotation

Han, Ting and Liu, Ximing and Takanabu, Ryuichi and Lian, Yixin and Huang, Chongxuan and Wan, Dazhen and Peng, Wei and Huang, Minlie. MultiWOZ 2.3: A Multi-domain Task-Oriented Dialogue Dataset Enhanced with Annotation Corrections and Co-Reference Annotation. Natural Language Processing and Chinese Computing. 2021

2021

[30] [30]

M ulti WOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset with Essential Annotation Corrections to Improve State Tracking Evaluation

Ye, Fanghua and Manotumruksa, Jarana and Yilmaz, Emine. M ulti WOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset with Essential Annotation Corrections to Improve State Tracking Evaluation. Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2022. doi:10.18653/v1/2022.sigdial-1.34

work page doi:10.18653/v1/2022.sigdial-1.34 2022

[31] [31]

MASSIVE : A 1 M -Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages

FitzGerald, Jack and Hench, Christopher and Peris, Charith and Mackie, Scott and Rottmann, Kay and Sanchez, Ana and Nash, Aaron and Urbach, Liam and Kakarala, Vishesh and Singh, Richa and Ranganath, Swetha and Crist, Laurie and Britan, Misha and Leeuwis, Wouter and Tur, Gokhan and Natarajan, Prem. MASSIVE : A 1 M -Example Multilingual Natural Language Und...

work page doi:10.18653/v1/2023.acl-long.235 2023

[32] [32]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , url =

Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Hong, Lauren and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and li, dahai and Liu, Zhiyuan and Sun, Maosong , booktitle =. ToolLLM: Facilitating Large Language Models t...

[33] [33]

CRMA rena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

Huang, Kung-Hsiang and Prabhakar, Akshara and Dhawan, Sidharth and Mao, Yixin and Wang, Huan and Savarese, Silvio and Xiong, Caiming and Laban, Philippe and Wu, Chien-Sheng. CRMA rena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments. Proceedings of the 2025 Conference of the Nations of the Americas Chap...

work page doi:10.18653/v1/2025.naacl-long.194 2025

[34] [34]

Frank F. Xu and Yufan Song and Boxuan Li and Yuxuan Tang and Kritanjali Jain and Mengxue Bao and Zora Zhiruo Wang and Xuhui Zhou and Zhitong Guo and Murong Cao and Mingyang Yang and Hao Yang Lu and Amaad Martin and Zhe Su and Leander Melroy Maben and Raj Mehta and Wayne Chi and Lawrence Keunho Jang and Yiqing Xie and Shuyan Zhou and Graham Neubig , bookti...

2026

[35] [35]

2024 , eprint=

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , author=. 2024 , eprint=

2024

[36] [36]

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks , url =

Boisvert, L\'. WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks , url =. Advances in Neural Information Processing Systems , doi =

[37] [37]

Kim and Samuel Miserendino and Gildas Chabot and David Li and Patrick Chao and Michael Sharman and Alexandra Barr and Amelia Glaese and Jerry Tworek , booktitle=

Tejal Patwardhan and Rachel Dias and Elizabeth Proehl and Grace Kim and Michele Wang and Olivia Watkins and Simon Posada Fishman and Marwan Aljubeh and Phoebe Thacker and Laurance Fauconnet and Natalie S. Kim and Samuel Miserendino and Gildas Chabot and David Li and Patrick Chao and Michael Sharman and Alexandra Barr and Amelia Glaese and Jerry Tworek , b...

2026

[38] [38]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

[39] [39]

Global MMLU : Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Singh, Shivalika and Romanou, Angelika and Fourrier, Cl \'e mentine and Adelani, David Ifeoluwa and Ngui, Jian Gang and Vila-Suero, Daniel and Limkonchotiwat, Peerat and Marchisio, Kelly and Leong, Wei Qi and Susanto, Yosephine and Ng, Raymond and Longpre, Shayne and Ruder, Sebastian and Ko, Wei-Yin and Bosselut, Antoine and Oh, Alice and Martins, Andre a...

work page doi:10.18653/v1/2025.acl-long.919 2025

[40] [40]

G lot E val: A Test Suite for Massively Multilingual Evaluation of Large Language Models

Luo, Hengyu and Li, Zihao and Attieh, Joseph and Devkota, Sawal and de Gibert, Ona and Huang, Xu and Ji, Shaoxiong and Lin, Peiqin and Mantina, Bhavani Sai Praneeth Varma and Sreenidhi, Ananda and V. G lot E val: A Test Suite for Massively Multilingual Evaluation of Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural ...

work page doi:10.18653/v1/2025.emnlp-demos.43 2025

[41] [41]

2024 , eprint=

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark , author=. 2024 , eprint=

2024

[42] [42]

2025 , howpublished =

2025

[43] [43]

N usa X : Multilingual Parallel Sentiment Dataset for 10 I ndonesian Local Languages

Winata, Genta Indra and Aji, Alham Fikri and Cahyawijaya, Samuel and Mahendra, Rahmad and Koto, Fajri and Romadhony, Ade and Kurniawan, Kemal and Moeljadi, David and Prasojo, Radityo Eko and Fung, Pascale and Baldwin, Timothy and Lau, Jey Han and Sennrich, Rico and Ruder, Sebastian. N usa X : Multilingual Parallel Sentiment Dataset for 10 I ndonesian Loca...

work page doi:10.18653/v1/2023.eacl-main.57 2023

[44] [44]

N usa W rites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

Cahyawijaya, Samuel and Lovenia, Holy and Koto, Fajri and Adhista, Dea and Dave, Emmanuel and Oktavianti, Sarah and Akbar, Salsabil and Lee, Jhonson and Shadieq, Nuur and Cenggoro, Tjeng Wawan and Linuwih, Hanung and Wilie, Bryan and Muridan, Galih and Winata, Genta and Moeljadi, David and Aji, Alham Fikri and Purwarianti, Ayu and Fung, Pascale. N usa W r...

work page doi:10.18653/v1/2023.ijcnlp-main.60 2023

[45] [45]

N usa D ialogue: Dialogue Summarization and Generation for Underrepresented and Extremely Low-Resource Languages

Purwarianti, Ayu and Adhista, Dea and Baptiso, Agung and Mahfuzh, Miftahul and Sabila, Yusrina and Adila, Aulia and Cahyawijaya, Samuel and Aji, Alham Fikri. N usa D ialogue: Dialogue Summarization and Generation for Underrepresented and Extremely Low-Resource Languages. Proceedings of the Second Workshop in South East Asian Language Processing. 2025

2025

[46] [46]

I ndo T o D : A Multi-Domain I ndonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

Kautsar, Muhammad and Nurdini, Rahmah and Cahyawijaya, Samuel and Winata, Genta and Purwarianti, Ayu. I ndo T o D : A Multi-Domain I ndonesian Benchmark For End-to-End Task-Oriented Dialogue Systems. Proceedings of the First Workshop in South East Asian Language Processing. 2023. doi:10.18653/v1/2023.sealp-1.7

work page doi:10.18653/v1/2023.sealp-1.7 2023

[47] [47]

and Santoso, Jennifer and Aco, Elyanah and Fadhilah, Akhdan and Mansurov, Jonibek and Imperial, Joseph Marvin and Kampman, Onno P

Lovenia, Holy and Mahendra, Rahmad and Akbar, Salsabil Maulana and Miranda, Lester James V. and Santoso, Jennifer and Aco, Elyanah and Fadhilah, Akhdan and Mansurov, Jonibek and Imperial, Joseph Marvin and Kampman, Onno P. and Moniz, Joel Ruben Antony and Habibi, Muhammad Ravi Shulthan and Hudi, Frederikus and Montalan, Railey and Ignatius, Ryan and Lopo,...

work page doi:10.18653/v1/2024.emnlp-main.296 2024

[48] [48]

Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Senses

Cahyawijaya, Samuel and Zhang, Ruochen and Cruz, Jan Christian Blaise and Lovenia, Holy and Gilbert, Elisa and Nomoto, Hiroki and Aji, Alham Fikri. Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Senses. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.f...

work page doi:10.18653/v1/2025.findings-naacl.178 2025

[49] [49]

S ea E xam and S ea B ench: Benchmarking LLM s with Local Multilingual Questions in S outheast A sia

Liu, Chaoqun and Zhang, Wenxuan and Ying, Jiahao and Aljunied, Mahani and Luu, Anh Tuan and Bing, Lidong. S ea E xam and S ea B ench: Benchmarking LLM s with Local Multilingual Questions in S outheast A sia. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.341

work page doi:10.18653/v1/2025.findings-naacl.341 2025

[50] [50]

SEA - HELM : S outheast A sian Holistic Evaluation of Language Models

Susanto, Yosephine and Hulagadri, Adithya Venkatadri and Montalan, Jann Railey and Ngui, Jian Gang and Yong, Xianbin and Leong, Wei Qi and Rengarajan, Hamsawardhini and Limkonchotiwat, Peerat and Mai, Yifan and Tjhi, William Chandra. SEA - HELM : S outheast A sian Holistic Evaluation of Language Models. Findings of the Association for Computational Lingui...

work page doi:10.18653/v1/2025.findings-acl.636 2025

[51] [51]

N usa C rowd: Open Source Initiative for I ndonesian NLP Resources

Cahyawijaya, Samuel and Lovenia, Holy and Aji, Alham Fikri and Winata, Genta and Wilie, Bryan and Koto, Fajri and Mahendra, Rahmad and Wibisono, Christian and Romadhony, Ade and Vincentio, Karissa and Santoso, Jennifer and Moeljadi, David and Wirawan, Cahya and Hudi, Frederikus and Wicaksono, Muhammad Satrio and Parmonangan, Ivan and Alfina, Ika and Putra...

work page doi:10.18653/v1/2023.findings-acl.868 2023

[52] [52]

and Han, Kenneth Chen Ko and Santos, Anjela Gail D

Cahyawijaya, Samuel and Lovenia, Holy and Moniz, Joel Ruben Antony and Wong, Tack Hwa and Farhansyah, Mohammad Rifqi and Maung, Thant Thiri and Hudi, Frederikus and Anugraha, David and Habibi, Muhammad Ravi Shulthan and Qorib, Muhammad Reza and Agarwal, Amit and Imperial, Joseph Marvin and Patel, Hitesh Laxmichand and Feliren, Vicky and Nasution, Bahrul I...

work page doi:10.18653/v1/2025.acl-long.916 2025

[53] [53]

2026 , eprint=

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model , author=. 2026 , eprint=

2026

[54] [54]

SEA - VQA : S outheast A sian Cultural Context Dataset For Visual Question Answering

Urailertprasert, Norawit and Limkonchotiwat, Peerat and Suwajanakorn, Supasorn and Nutanong, Sarana. SEA - VQA : S outheast A sian Cultural Context Dataset For Visual Question Answering. Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR). 2024. doi:10.18653/v1/2024.alvr-1.15

work page doi:10.18653/v1/2024.alvr-1.15 2024

[55] [55]

2024 , howpublished =

2024

[56] [56]

2025 , eprint=

^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. 2025 , eprint=

2025

[57] [57]

2026 , eprint=

-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains , author=. 2026 , eprint=

2026

[58] [58]

Position: The Right to

Rashid Mushkani and Hugo Berard and Allison Cohen and Shin Koseki , booktitle=. Position: The Right to. 2025 , url=

2025

[59] [59]

2026 , url =

Barasa, Hilda and Tay, PeiChin and McBride, Keegan and Iosad, Alexander and M. 2026 , url =

2026

[60] [60]

Sovereignty in the Age of AI: Strategic Choices, Structural Dependencies , month = jan, year =

[61] [61]

2026 , month = apr, day =

Bhandari, Manik and Modi, Gaurav , title =. 2026 , month = apr, day =

2026

[62] [62]

Why Sovereign Artificial Intelligence Is Imperative in Southeast Asia , year =

[63] [63]

Frontiers in Artificial Intelligence , VOLUME=

Putra, Bama Andika , TITLE=. Frontiers in Artificial Intelligence , VOLUME=. 2024 , URL=. doi:10.3389/frai.2024.1411838 , ISSN=

work page doi:10.3389/frai.2024.1411838 2024

[64] [64]

2024 , url =

Wenxuan Zhang and Hou Pong Chan and Yiran Zhao and Mahani Aljunied and Jianyu Wang and Chaoqun Liu and Yue Deng and Zhiqiang Hu and Weiwen Xu and Yew Ken Chia and Xin Li and Lidong Bing , title =. 2024 , url =

2024

[65] [65]

2024 , booktitle =

Xuan-Phi Nguyen and Wenxuan Zhang and Xin Li and Mahani Aljunied and Zhiqiang Hu and Chenhui Shen and Yew Ken Chia and Xingxuan Li and Jianyu Wang, Qingyu Tan and Liying Cheng and Guanzheng Chen and Yue Deng and Sen Yang and Chaoqun Liu and Hang Zhang and Lidong Bing , title =. 2024 , booktitle =

2024

[66] [66]

2026 , eprint=

Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations , author=. 2026 , eprint=

2026