Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models
read the original abstract
Large Language Models (LLMs) are increasingly deployed as agents that invoke external tools through structured function calls. While recent work reports strong tool-calling performance under standard English-centric evaluations, the robustness of tool calling under multilingual user interactions remains underexplored. In this work, we introduce MLCL, a diagnostic benchmark, and conduct a systematic evaluation of multilingual tool calling across Chinese, Hindi, and the low-resource language Igbo. Through fine-grained error analysis, we show that many failures occur despite correct intent understanding and tool selection. We identify parameter value language mismatch as a dominant failure mode, where models generate semantically appropriate parameter values in the user's language, violating language-invariant execution conventions. We further evaluate several inference-time system strategies and find that while these strategies substantially reduce language-induced execution errors, none of them can fully recover English-level performance.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation
A new workflow for multilingual agent benchmark adaptation using functional, cultural, and difficulty alignments improves non-English agent success rates by up to 32.7% over simple machine translation, indicating subs...
-
What to Format and How: A Benchmark and Workflow Approach for Document Formatting
Presents DocFormBench benchmark and DocFormFlow workflow for content-aware LLM document formatting, claiming higher accuracy and lower token use via decoupled localization and modification.
-
The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
The paper proposes a unified MDP-based research agenda for addressing sim-to-real gaps in foundation model agents and advocates adopting classical solutions such as domain randomization.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.