Understanding the Ability of LLMs to Handle Character-Level Perturbation

Anyuan Zhuo; Jingyi Zhu; Ningyuan Li; Pinyan Lu; Xuefei Ning; Yu Wang

arxiv: 2510.14365 · v4 · pith:LGJAHYC7new · submitted 2025-10-16 · 💻 cs.CL

Understanding the Ability of LLMs to Handle Character-Level Perturbation

Anyuan Zhuo , Xuefei Ning , Ningyuan Li , Jingyi Zhu , Yu Wang , Pinyan Lu This is my paper

classification 💻 cs.CL

keywords llmscharacter-levelcharactersperturbationperturbationstextexamineincluding

0 comments

read the original abstract

This work investigates the resilience of contemporary large language models (LLMs) against frequent character-level perturbations. We examine three types of character-level perturbations including introducing numerous typos within words, shuffling the characters in each word, and inserting a large number of invisible characters into the text. Surprisingly, even under severe perturbation, such as shuffling nearly all words character-wise to produce text that is almost unreadable to humans, or inserting invisible characters which are several times more than the visible ones as noise, many LLMs still maintain notable performance. We explore the underlying causes of this robustness and find that LLMs exhibit remarkable resilience to chaotic segmentation and fragmented tokenization. Furthermore, we examine the mechanisms by which LLMs remove perturbations to correctly comprehend text, including both implicit and explicit mechanisms for character-level perturbation. We hope that our findings on the low-level robustness of LLMs will unveil their inherent architectural strengths, reveal the potential risks of their misuse, and inform the reliable deployment of LLMs across diverse application scenarios.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
cs.CL 2026-05 unverdicted novelty 7.0

LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.