pith. sign in

arxiv: 2510.09536 · v3 · submitted 2025-10-10 · 💻 cs.CL

Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

Pith reviewed 2026-05-18 07:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsrobustness evaluationtypographical errorsmultilingual NLPinstruction tuningnatural language inferencemachine translationreasoning
0
0 comments X

The pith

Typographical errors degrade large language model performance across languages, with reasoning tasks most affected.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models remain reliable when users introduce typographical errors in different languages. To do this, it develops MulTypo to simulate realistic typos based on keyboard layouts and typing patterns. Across 18 models and five tasks, typos lower performance, hitting generative and reasoning tasks harder than natural language inference. Instruction tuning raises clean performance but can increase sensitivity to noise. Robustness is higher for high-resource languages and for translations out of English.

Core claim

Typos consistently degrade performance, particularly in generative tasks and those requiring reasoning while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. Language-dependent robustness is observed, with high-resource languages generally more robust than low-resource ones, and translation from English more robust than into English.

What carries the argument

MulTypo, the algorithm for generating multilingual typographical errors by simulating errors from language-specific keyboard layouts and typical typing behavior.

If this is right

  • LLM benchmarks should include typo noise to better reflect real usage.
  • Special techniques are needed to make reasoning and generation robust to typos.
  • Low-resource languages require targeted improvements for robustness.
  • Instruction tuning alone is insufficient for noisy input scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models could be fine-tuned on datasets with simulated typos to enhance resilience.
  • Real-world applications in customer support or education may need input cleaning or error-tolerant designs.
  • Extending this to other error types like missing diacritics could reveal broader patterns in multilingual robustness.

Load-bearing premise

MulTypo-generated typos based on keyboard layouts accurately reflect the kinds of mistakes humans actually make when typing in those languages.

What would settle it

If real human-generated typos in the same languages and tasks produce different performance patterns than MulTypo typos, the simulation's validity would be questioned.

read the original abstract

Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs -- naturally introducing \emph{typographical errors} (typos). Yet most benchmarks assume clean input, leaving the robustness of LLMs to typos across languages largely underexplored. To address this gap, we introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior. We evaluate 18 open-source LLMs across three model families and five downstream tasks spanning language inference, multi-choice question answering, mathematical reasoning, and machine translation tasks. Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning -- while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. We also observe language-dependent robustness: high-resource languages are generally more robust than low-resource ones, and translation from English is more robust than translation into English. Our findings underscore the need for noise-aware training and multilingual robustness evaluation. We release a Python package for MulTypo and make the source code publicly available at https://github.com/cisnlp/multypo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces MulTypo, a multilingual typo generation algorithm simulating human-like errors via language-specific keyboard layouts and probabilistic typing models. It evaluates 18 open-source LLMs from three families on five tasks (NLI, MCQA, mathematical reasoning, MT) across languages, reporting consistent performance degradation from typos—stronger in generative/reasoning tasks than NLI—with language-dependent patterns (high-resource more robust) and the observation that instruction tuning boosts clean performance but may heighten noise sensitivity. The work releases a Python package and code.

Significance. If MulTypo errors prove representative of real user inputs, the results highlight practical robustness gaps for LLMs in multilingual deployments and motivate noise-aware training. The public release of the MulTypo implementation and evaluation code is a clear strength for reproducibility.

major comments (1)
  1. [MulTypo description] MulTypo description (likely §3): No quantitative validation is provided comparing the generated typo distributions (edit-type frequencies, substitution matrices) to empirical human error corpora for the five languages. This is load-bearing for the central claim that observed degradations demonstrate real-world brittleness rather than artifacts of the chosen noise model.
minor comments (1)
  1. [Abstract] Abstract: The statement on instruction tuning increasing brittleness would be strengthened by reporting specific delta values or statistical tests rather than qualitative description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment below and outline planned revisions.

read point-by-point responses
  1. Referee: [MulTypo description] MulTypo description (likely §3): No quantitative validation is provided comparing the generated typo distributions (edit-type frequencies, substitution matrices) to empirical human error corpora for the five languages. This is load-bearing for the central claim that observed degradations demonstrate real-world brittleness rather than artifacts of the chosen noise model.

    Authors: We acknowledge that a direct quantitative comparison to human error corpora would strengthen the ecological validity of MulTypo. The algorithm is constructed from language-specific keyboard layouts and probabilistic models drawn from established studies of typing behavior. Comprehensive, publicly available human typo corpora with edit-type frequencies and substitution matrices do not exist for all five languages in our evaluation, particularly the lower-resource ones. We will revise Section 3 to (i) provide a more explicit derivation of the probabilistic parameters with citations to prior linguistic work, (ii) include a comparison against available English human-error datasets where possible, and (iii) add an explicit limitations paragraph discussing the absence of such corpora for the remaining languages. These changes will be incorporated in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation study

full rationale

This paper conducts an empirical evaluation by defining the MulTypo algorithm to simulate typos and then measuring observed performance drops on LLMs across tasks and languages. The central claims rest on direct experimental results rather than any derivation, fitted parameter, or self-referential definition that reduces the outcome to the inputs by construction. No mathematical models, predictions, or load-bearing self-citations are present; the methodology is self-contained and externally falsifiable via replication.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that keyboard-layout-based error simulation captures human typing behavior; no free parameters or invented physical entities are introduced, but the method itself embodies domain assumptions about typing patterns.

axioms (1)
  • domain assumption Language-specific keyboard layouts and typical typing behavior can be used to generate representative typographical errors
    Invoked in the description of the MulTypo algorithm in the abstract

pith-pipeline@v0.9.0 · 5755 in / 1263 out tokens · 25547 ms · 2026-05-18T07:48:03.948978+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.