Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

Hinrich Sch\"utze; Lena Altinger; Michael A. Hedderich; Raoyuan Zhao; Yihong Liu

arxiv: 2510.09536 · v3 · submitted 2025-10-10 · 💻 cs.CL

Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

Raoyuan Zhao , Yihong Liu , Lena Altinger , Hinrich Sch\"utze , Michael A. Hedderich This is my paper

Pith reviewed 2026-05-18 07:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsrobustness evaluationtypographical errorsmultilingual NLPinstruction tuningnatural language inferencemachine translationreasoning

0 comments

The pith

Typographical errors degrade large language model performance across languages, with reasoning tasks most affected.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models remain reliable when users introduce typographical errors in different languages. To do this, it develops MulTypo to simulate realistic typos based on keyboard layouts and typing patterns. Across 18 models and five tasks, typos lower performance, hitting generative and reasoning tasks harder than natural language inference. Instruction tuning raises clean performance but can increase sensitivity to noise. Robustness is higher for high-resource languages and for translations out of English.

Core claim

Typos consistently degrade performance, particularly in generative tasks and those requiring reasoning while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. Language-dependent robustness is observed, with high-resource languages generally more robust than low-resource ones, and translation from English more robust than into English.

What carries the argument

MulTypo, the algorithm for generating multilingual typographical errors by simulating errors from language-specific keyboard layouts and typical typing behavior.

If this is right

LLM benchmarks should include typo noise to better reflect real usage.
Special techniques are needed to make reasoning and generation robust to typos.
Low-resource languages require targeted improvements for robustness.
Instruction tuning alone is insufficient for noisy input scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models could be fine-tuned on datasets with simulated typos to enhance resilience.
Real-world applications in customer support or education may need input cleaning or error-tolerant designs.
Extending this to other error types like missing diacritics could reveal broader patterns in multilingual robustness.

Load-bearing premise

MulTypo-generated typos based on keyboard layouts accurately reflect the kinds of mistakes humans actually make when typing in those languages.

What would settle it

If real human-generated typos in the same languages and tasks produce different performance patterns than MulTypo typos, the simulation's validity would be questioned.

read the original abstract

Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs -- naturally introducing \emph{typographical errors} (typos). Yet most benchmarks assume clean input, leaving the robustness of LLMs to typos across languages largely underexplored. To address this gap, we introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior. We evaluate 18 open-source LLMs across three model families and five downstream tasks spanning language inference, multi-choice question answering, mathematical reasoning, and machine translation tasks. Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning -- while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. We also observe language-dependent robustness: high-resource languages are generally more robust than low-resource ones, and translation from English is more robust than translation into English. Our findings underscore the need for noise-aware training and multilingual robustness evaluation. We release a Python package for MulTypo and make the source code publicly available at https://github.com/cisnlp/multypo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MulTypo gives a practical way to test multilingual typo robustness at scale and the results show clear task and language differences in degradation, but the simulation lacks direct checks against real human error data.

read the letter

The main things to know are that MulTypo offers a new way to simulate typos across languages and the results indicate consistent performance hits that vary by task and language resource level. The work stands out for its breadth: testing 18 open-source models on tasks like NLI, MCQA, math reasoning, and machine translation, with attention to high versus low resource languages. Releasing the Python package is a plus for anyone wanting to build on this. It organizes the findings clearly, highlighting that generative and reasoning tasks suffer more while NLI is tougher, and that instruction tuning might trade off some robustness. The potential issue is whether the simulated typos reflect real human errors. The method uses language-specific keyboards and typing probabilities, but without a side-by-side comparison to actual typo corpora on metrics like substitution rates or edit distances, it's hard to know if the brittleness findings would hold for genuine user inputs. That assumption feels central and could use more support. This paper would interest people building or evaluating LLMs for noisy, multilingual interfaces. It brings new data and a tool, so it deserves a serious referee even if revisions are needed on the validation side. Recommendation: send it out for peer review.

Referee Report

1 major / 1 minor

Summary. The paper introduces MulTypo, a multilingual typo generation algorithm simulating human-like errors via language-specific keyboard layouts and probabilistic typing models. It evaluates 18 open-source LLMs from three families on five tasks (NLI, MCQA, mathematical reasoning, MT) across languages, reporting consistent performance degradation from typos—stronger in generative/reasoning tasks than NLI—with language-dependent patterns (high-resource more robust) and the observation that instruction tuning boosts clean performance but may heighten noise sensitivity. The work releases a Python package and code.

Significance. If MulTypo errors prove representative of real user inputs, the results highlight practical robustness gaps for LLMs in multilingual deployments and motivate noise-aware training. The public release of the MulTypo implementation and evaluation code is a clear strength for reproducibility.

major comments (1)

[MulTypo description] MulTypo description (likely §3): No quantitative validation is provided comparing the generated typo distributions (edit-type frequencies, substitution matrices) to empirical human error corpora for the five languages. This is load-bearing for the central claim that observed degradations demonstrate real-world brittleness rather than artifacts of the chosen noise model.

minor comments (1)

[Abstract] Abstract: The statement on instruction tuning increasing brittleness would be strengthened by reporting specific delta values or statistical tests rather than qualitative description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment below and outline planned revisions.

read point-by-point responses

Referee: [MulTypo description] MulTypo description (likely §3): No quantitative validation is provided comparing the generated typo distributions (edit-type frequencies, substitution matrices) to empirical human error corpora for the five languages. This is load-bearing for the central claim that observed degradations demonstrate real-world brittleness rather than artifacts of the chosen noise model.

Authors: We acknowledge that a direct quantitative comparison to human error corpora would strengthen the ecological validity of MulTypo. The algorithm is constructed from language-specific keyboard layouts and probabilistic models drawn from established studies of typing behavior. Comprehensive, publicly available human typo corpora with edit-type frequencies and substitution matrices do not exist for all five languages in our evaluation, particularly the lower-resource ones. We will revise Section 3 to (i) provide a more explicit derivation of the probabilistic parameters with citations to prior linguistic work, (ii) include a comparison against available English human-error datasets where possible, and (iii) add an explicit limitations paragraph discussing the absence of such corpora for the remaining languages. These changes will be incorporated in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation study

full rationale

This paper conducts an empirical evaluation by defining the MulTypo algorithm to simulate typos and then measuring observed performance drops on LLMs across tasks and languages. The central claims rest on direct experimental results rather than any derivation, fitted parameter, or self-referential definition that reduces the outcome to the inputs by construction. No mathematical models, predictions, or load-bearing self-citations are present; the methodology is self-contained and externally falsifiable via replication.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that keyboard-layout-based error simulation captures human typing behavior; no free parameters or invented physical entities are introduced, but the method itself embodies domain assumptions about typing patterns.

axioms (1)

domain assumption Language-specific keyboard layouts and typical typing behavior can be used to generate representative typographical errors
Invoked in the description of the MulTypo algorithm in the abstract

pith-pipeline@v0.9.0 · 5755 in / 1263 out tokens · 25547 ms · 2026-05-18T07:48:03.948978+00:00 · methodology

Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)