The Astonishing Ability of Large Language Models to Parse Jabberwockified Language
Pith reviewed 2026-05-15 18:53 UTC · model grok-4.3
The pith
Large language models recover original meaning from English sentences where content words are replaced by random nonsense strings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models have an astonishing ability to recover meaning from severely degraded English texts in which content words have been randomly substituted by nonsense strings, translating them to conventional English that is in many cases close to the original. These results demonstrate that structural cues such as morphosyntax and closed-class words constrain lexical meaning to a much larger degree than previously imagined, showing that efficient language processing in artificial systems benefits from very tight integration between syntax, lexical semantics, and general world knowledge.
What carries the argument
Morphosyntactic structure and closed-class words that constrain and recover lexical meanings from nonsense-substituted input.
If this is right
- Structural cues alone can support recovery of lexical meaning far beyond what isolated word knowledge would allow.
- Language processing systems gain efficiency from tight coupling of syntax, semantics, and world knowledge.
- Abilities shown on Jabberwockified English are relevant to understanding linguistic structure in both artificial and biological systems.
- Robustness to degraded input becomes a measurable property that can be improved by strengthening structural integration.
Where Pith is reading between the lines
- The same structural constraints may explain human success at understanding noisy speech or heavily accented language.
- Designing future models with explicit emphasis on closed-class scaffolding could increase robustness to real-world noise without more data.
- This work supplies a new benchmark for testing whether language models truly separate syntactic knowledge from memorized lexical patterns.
Load-bearing premise
The models succeed mainly by using grammar and function words to guide meaning rather than by matching patterns seen in their training data or responding to specific prompting tricks.
What would settle it
Test the same models on new Jabberwockified texts that use entirely novel nonsense strings never present in training data and measure whether translation quality drops to near zero.
read the original abstract
We show that large language models (LLMs) have an astonishing ability to recover meaning from severely degraded English texts. Texts in which content words have been randomly substituted by nonsense strings, e.g., "At the ghybe of the swuint, we are haiveed to Wourge Phrear-gwurr, who sproles into an ghitch flount with his crurp", can be translated to conventional English that is, in many cases, close to the original text, e.g., "At the start of the story, we meet a man, Chow, who moves into an apartment building with his wife." These results show that structural cues (e.g., morphosyntax, closed-class words) constrain lexical meaning to a much larger degree than imagined. Although the abilities of LLMs to make sense of "Jabberwockified" English are clearly superhuman, they are highly relevant to understanding linguistic structure and suggest that efficient language processing either in biological or artificial systems likely benefits from very tight integration between syntax, lexical semantics, and general world knowledge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that large language models can recover near-original meaning from severely degraded English texts in which content words have been randomly substituted by nonsense strings (e.g., 'At the ghybe of the swuint, we are haiveed to Wourge Phrear-gwurr, who sproles into an ghitch flount with his crurp' translated to 'At the start of the story, we meet a man, Chow, who moves into an apartment building with his wife'). It argues that this demonstrates structural cues such as morphosyntax and closed-class words constrain lexical meaning to a much larger degree than previously thought, with the abilities being superhuman yet relevant to understanding linguistic structure and efficient processing in biological or artificial systems.
Significance. If the central claim holds under rigorous evaluation, the result would be significant for computational linguistics by illustrating the tight integration of syntax, semantics, and world knowledge in LLMs. It could inform theories of language processing efficiency and suggest new directions for modeling how structural constraints guide interpretation, extending beyond standard benchmarks to highlight emergent capabilities in handling degraded input.
major comments (2)
- [Abstract] Abstract: The central claim that LLMs translate Jabberwockified texts 'in many cases close to the original' is supported solely by illustrative examples. No quantitative metrics (e.g., accuracy rates, semantic similarity scores), sample sizes, baseline comparisons, or statistical tests are reported, leaving the generality and reliability of the 'astonishing ability' without sufficient documented support for rigorous evaluation.
- [Abstract] Abstract: Potential confounds such as training-data overlap, prompting artifacts, or the specific distribution of nonsense substitutions are not addressed or controlled for. This is load-bearing for the claim that success stems primarily from leveraging morphosyntactic structure rather than alternative explanations.
minor comments (1)
- [Abstract] Abstract: The method used to generate the random nonsense substitutions (e.g., criteria for selecting replacement strings or ensuring randomness) is not described, which would aid replicability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We agree that the current manuscript relies primarily on illustrative examples and lacks quantitative support, which limits rigorous evaluation of the claims. We will revise the paper substantially to include systematic experiments, metrics, baselines, and controls for confounds, while preserving the core observation about structural constraints in LLMs.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that LLMs translate Jabberwockified texts 'in many cases close to the original' is supported solely by illustrative examples. No quantitative metrics (e.g., accuracy rates, semantic similarity scores), sample sizes, baseline comparisons, or statistical tests are reported, leaving the generality and reliability of the 'astonishing ability' without sufficient documented support for rigorous evaluation.
Authors: We acknowledge that the present version presents the phenomenon through selected examples without accompanying quantitative evaluation. In the revised manuscript we will add a dedicated evaluation section reporting results on a held-out set of 200 Jabberwockified sentences drawn from diverse sources (news, fiction, technical text). We will report (i) human-rated semantic fidelity on a 1–5 scale with inter-annotator agreement, (ii) automatic metrics including BERTScore and sentence-level BLEURT against the original English, and (iii) comparison against two baselines: a bag-of-words reconstruction model and a syntax-ablated control that randomizes closed-class items. Statistical significance will be assessed via paired t-tests and bootstrap confidence intervals. Sample size and selection criteria will be fully documented. revision: yes
-
Referee: [Abstract] Abstract: Potential confounds such as training-data overlap, prompting artifacts, or the specific distribution of nonsense substitutions are not addressed or controlled for. This is load-bearing for the claim that success stems primarily from leveraging morphosyntactic structure rather than alternative explanations.
Authors: We agree these confounds must be ruled out. The revision will include three new control experiments: (1) substitution of content words with strings drawn from a held-out vocabulary never seen in pre-training data (verified via tokenizer inspection); (2) systematic variation of prompt phrasing (zero-shot, few-shot, chain-of-thought) with performance stability reported; (3) an ablation that preserves lexical items but disrupts morphosyntax (e.g., random word-order scrambling within clauses). We will also report the exact substitution procedure (uniform sampling over a fixed nonsense lexicon) and test robustness across different nonsense-generation distributions. These controls will be presented alongside the main results to isolate the contribution of structural cues. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical demonstration reporting LLM performance on constructed jabberwockified inputs. It contains no mathematical derivations, equations, fitted parameters, or self-referential definitions. The central claim follows directly from the observed translations without any reduction to inputs by construction, self-citation chains, or smuggled ansatzes. The provided abstract and framing rely on direct evidence rather than any load-bearing logical step that collapses into its own premises.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models trained on broad text corpora can leverage contextual, syntactic, and world-knowledge cues to infer missing lexical items.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.