Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors
Pith reviewed 2026-05-18 07:48 UTC · model grok-4.3
The pith
Typographical errors degrade large language model performance across languages, with reasoning tasks most affected.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Typos consistently degrade performance, particularly in generative tasks and those requiring reasoning while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. Language-dependent robustness is observed, with high-resource languages generally more robust than low-resource ones, and translation from English more robust than into English.
What carries the argument
MulTypo, the algorithm for generating multilingual typographical errors by simulating errors from language-specific keyboard layouts and typical typing behavior.
If this is right
- LLM benchmarks should include typo noise to better reflect real usage.
- Special techniques are needed to make reasoning and generation robust to typos.
- Low-resource languages require targeted improvements for robustness.
- Instruction tuning alone is insufficient for noisy input scenarios.
Where Pith is reading between the lines
- Models could be fine-tuned on datasets with simulated typos to enhance resilience.
- Real-world applications in customer support or education may need input cleaning or error-tolerant designs.
- Extending this to other error types like missing diacritics could reveal broader patterns in multilingual robustness.
Load-bearing premise
MulTypo-generated typos based on keyboard layouts accurately reflect the kinds of mistakes humans actually make when typing in those languages.
What would settle it
If real human-generated typos in the same languages and tasks produce different performance patterns than MulTypo typos, the simulation's validity would be questioned.
read the original abstract
Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs -- naturally introducing \emph{typographical errors} (typos). Yet most benchmarks assume clean input, leaving the robustness of LLMs to typos across languages largely underexplored. To address this gap, we introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior. We evaluate 18 open-source LLMs across three model families and five downstream tasks spanning language inference, multi-choice question answering, mathematical reasoning, and machine translation tasks. Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning -- while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. We also observe language-dependent robustness: high-resource languages are generally more robust than low-resource ones, and translation from English is more robust than translation into English. Our findings underscore the need for noise-aware training and multilingual robustness evaluation. We release a Python package for MulTypo and make the source code publicly available at https://github.com/cisnlp/multypo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MulTypo, a multilingual typo generation algorithm simulating human-like errors via language-specific keyboard layouts and probabilistic typing models. It evaluates 18 open-source LLMs from three families on five tasks (NLI, MCQA, mathematical reasoning, MT) across languages, reporting consistent performance degradation from typos—stronger in generative/reasoning tasks than NLI—with language-dependent patterns (high-resource more robust) and the observation that instruction tuning boosts clean performance but may heighten noise sensitivity. The work releases a Python package and code.
Significance. If MulTypo errors prove representative of real user inputs, the results highlight practical robustness gaps for LLMs in multilingual deployments and motivate noise-aware training. The public release of the MulTypo implementation and evaluation code is a clear strength for reproducibility.
major comments (1)
- [MulTypo description] MulTypo description (likely §3): No quantitative validation is provided comparing the generated typo distributions (edit-type frequencies, substitution matrices) to empirical human error corpora for the five languages. This is load-bearing for the central claim that observed degradations demonstrate real-world brittleness rather than artifacts of the chosen noise model.
minor comments (1)
- [Abstract] Abstract: The statement on instruction tuning increasing brittleness would be strengthened by reporting specific delta values or statistical tests rather than qualitative description.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address the major comment below and outline planned revisions.
read point-by-point responses
-
Referee: [MulTypo description] MulTypo description (likely §3): No quantitative validation is provided comparing the generated typo distributions (edit-type frequencies, substitution matrices) to empirical human error corpora for the five languages. This is load-bearing for the central claim that observed degradations demonstrate real-world brittleness rather than artifacts of the chosen noise model.
Authors: We acknowledge that a direct quantitative comparison to human error corpora would strengthen the ecological validity of MulTypo. The algorithm is constructed from language-specific keyboard layouts and probabilistic models drawn from established studies of typing behavior. Comprehensive, publicly available human typo corpora with edit-type frequencies and substitution matrices do not exist for all five languages in our evaluation, particularly the lower-resource ones. We will revise Section 3 to (i) provide a more explicit derivation of the probabilistic parameters with citations to prior linguistic work, (ii) include a comparison against available English human-error datasets where possible, and (iii) add an explicit limitations paragraph discussing the absence of such corpora for the remaining languages. These changes will be incorporated in the revised manuscript. revision: partial
Circularity Check
No significant circularity in empirical evaluation study
full rationale
This paper conducts an empirical evaluation by defining the MulTypo algorithm to simulate typos and then measuring observed performance drops on LLMs across tasks and languages. The central claims rest on direct experimental results rather than any derivation, fitted parameter, or self-referential definition that reduces the outcome to the inputs by construction. No mathematical models, predictions, or load-bearing self-citations are present; the methodology is self-contained and externally falsifiable via replication.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Language-specific keyboard layouts and typical typing behavior can be used to generate representative typographical errors
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.