The Astonishing Ability of Large Language Models to Parse Jabberwockified Language

Gary Lupyan; Senyi Yang

arxiv: 2602.23928 · v2 · submitted 2026-02-27 · 💻 cs.CL

The Astonishing Ability of Large Language Models to Parse Jabberwockified Language

Gary Lupyan , Senyi Yang This is my paper

Pith reviewed 2026-05-15 18:53 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsjabberwockydegraded textlexical recoverymorphosyntaxsemantic parsinglinguistic structure

0 comments

The pith

Large language models recover original meaning from English sentences where content words are replaced by random nonsense strings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs translate severely degraded texts—where most content words are swapped for invented nonsense—back into sensible English that often matches the original. This works because the remaining grammar, word order, and small function words like prepositions and articles provide strong constraints on what the nonsense must mean. A sympathetic reader would see this as evidence that language understanding depends far more on structural scaffolding than on knowing every individual word in advance. The finding matters because it reveals how tightly syntax, semantics, and background knowledge must work together for efficient comprehension in any system. It also points to new ways of testing how robust language processing can be when input is noisy or incomplete.

Core claim

Large language models have an astonishing ability to recover meaning from severely degraded English texts in which content words have been randomly substituted by nonsense strings, translating them to conventional English that is in many cases close to the original. These results demonstrate that structural cues such as morphosyntax and closed-class words constrain lexical meaning to a much larger degree than previously imagined, showing that efficient language processing in artificial systems benefits from very tight integration between syntax, lexical semantics, and general world knowledge.

What carries the argument

Morphosyntactic structure and closed-class words that constrain and recover lexical meanings from nonsense-substituted input.

If this is right

Structural cues alone can support recovery of lexical meaning far beyond what isolated word knowledge would allow.
Language processing systems gain efficiency from tight coupling of syntax, semantics, and world knowledge.
Abilities shown on Jabberwockified English are relevant to understanding linguistic structure in both artificial and biological systems.
Robustness to degraded input becomes a measurable property that can be improved by strengthening structural integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structural constraints may explain human success at understanding noisy speech or heavily accented language.
Designing future models with explicit emphasis on closed-class scaffolding could increase robustness to real-world noise without more data.
This work supplies a new benchmark for testing whether language models truly separate syntactic knowledge from memorized lexical patterns.

Load-bearing premise

The models succeed mainly by using grammar and function words to guide meaning rather than by matching patterns seen in their training data or responding to specific prompting tricks.

What would settle it

Test the same models on new Jabberwockified texts that use entirely novel nonsense strings never present in training data and measure whether translation quality drops to near zero.

read the original abstract

We show that large language models (LLMs) have an astonishing ability to recover meaning from severely degraded English texts. Texts in which content words have been randomly substituted by nonsense strings, e.g., "At the ghybe of the swuint, we are haiveed to Wourge Phrear-gwurr, who sproles into an ghitch flount with his crurp", can be translated to conventional English that is, in many cases, close to the original text, e.g., "At the start of the story, we meet a man, Chow, who moves into an apartment building with his wife." These results show that structural cues (e.g., morphosyntax, closed-class words) constrain lexical meaning to a much larger degree than imagined. Although the abilities of LLMs to make sense of "Jabberwockified" English are clearly superhuman, they are highly relevant to understanding linguistic structure and suggest that efficient language processing either in biological or artificial systems likely benefits from very tight integration between syntax, lexical semantics, and general world knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs recover meaning from random nonsense substitutions in content words via structure, but the evidence stays mostly illustrative.

read the letter

The main point is that large language models can take English sentences where every content word gets swapped for a random nonsense string and still output a version close to the original meaning. The example in the abstract works: the jabberwocky sentence about the ghybe of the swuint gets rendered as a story about a man named Chow moving into an apartment building with his wife. This shows the models leaning hard on morphosyntax and closed-class words to pin down what the nonsense must refer to. The setup itself is the fresh part. Randomly replacing content words with nonsense strings gives a direct probe into how tightly structure constrains lexical choice, and the paper connects it cleanly to the idea that syntax, semantics, and world knowledge work together in processing. That framing is useful for thinking about both artificial and biological language systems. The limitation is that the support shown is thin. The abstract offers only one worked example with no counts of how often it succeeds, no baselines, no controls for training-data overlap or prompting choices, and no statistical checks. Without those, it is hard to tell whether the recovery is consistent or selective. If the full paper supplies the missing numbers and tests, the result becomes more solid. As it stands, the observation is intriguing but preliminary. This is worth bringing to a reading group for people working on LLM interpretability or cognitive modeling of language. A reader who wants concrete tests of structural constraints on meaning will get something from it. It deserves peer review because the probe is straightforward and the question matters, even if the current version needs more data to carry the claim.

Referee Report

2 major / 1 minor

Summary. The paper claims that large language models can recover near-original meaning from severely degraded English texts in which content words have been randomly substituted by nonsense strings (e.g., 'At the ghybe of the swuint, we are haiveed to Wourge Phrear-gwurr, who sproles into an ghitch flount with his crurp' translated to 'At the start of the story, we meet a man, Chow, who moves into an apartment building with his wife'). It argues that this demonstrates structural cues such as morphosyntax and closed-class words constrain lexical meaning to a much larger degree than previously thought, with the abilities being superhuman yet relevant to understanding linguistic structure and efficient processing in biological or artificial systems.

Significance. If the central claim holds under rigorous evaluation, the result would be significant for computational linguistics by illustrating the tight integration of syntax, semantics, and world knowledge in LLMs. It could inform theories of language processing efficiency and suggest new directions for modeling how structural constraints guide interpretation, extending beyond standard benchmarks to highlight emergent capabilities in handling degraded input.

major comments (2)

[Abstract] Abstract: The central claim that LLMs translate Jabberwockified texts 'in many cases close to the original' is supported solely by illustrative examples. No quantitative metrics (e.g., accuracy rates, semantic similarity scores), sample sizes, baseline comparisons, or statistical tests are reported, leaving the generality and reliability of the 'astonishing ability' without sufficient documented support for rigorous evaluation.
[Abstract] Abstract: Potential confounds such as training-data overlap, prompting artifacts, or the specific distribution of nonsense substitutions are not addressed or controlled for. This is load-bearing for the claim that success stems primarily from leveraging morphosyntactic structure rather than alternative explanations.

minor comments (1)

[Abstract] Abstract: The method used to generate the random nonsense substitutions (e.g., criteria for selecting replacement strings or ensuring randomness) is not described, which would aid replicability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the current manuscript relies primarily on illustrative examples and lacks quantitative support, which limits rigorous evaluation of the claims. We will revise the paper substantially to include systematic experiments, metrics, baselines, and controls for confounds, while preserving the core observation about structural constraints in LLMs.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that LLMs translate Jabberwockified texts 'in many cases close to the original' is supported solely by illustrative examples. No quantitative metrics (e.g., accuracy rates, semantic similarity scores), sample sizes, baseline comparisons, or statistical tests are reported, leaving the generality and reliability of the 'astonishing ability' without sufficient documented support for rigorous evaluation.

Authors: We acknowledge that the present version presents the phenomenon through selected examples without accompanying quantitative evaluation. In the revised manuscript we will add a dedicated evaluation section reporting results on a held-out set of 200 Jabberwockified sentences drawn from diverse sources (news, fiction, technical text). We will report (i) human-rated semantic fidelity on a 1–5 scale with inter-annotator agreement, (ii) automatic metrics including BERTScore and sentence-level BLEURT against the original English, and (iii) comparison against two baselines: a bag-of-words reconstruction model and a syntax-ablated control that randomizes closed-class items. Statistical significance will be assessed via paired t-tests and bootstrap confidence intervals. Sample size and selection criteria will be fully documented. revision: yes
Referee: [Abstract] Abstract: Potential confounds such as training-data overlap, prompting artifacts, or the specific distribution of nonsense substitutions are not addressed or controlled for. This is load-bearing for the claim that success stems primarily from leveraging morphosyntactic structure rather than alternative explanations.

Authors: We agree these confounds must be ruled out. The revision will include three new control experiments: (1) substitution of content words with strings drawn from a held-out vocabulary never seen in pre-training data (verified via tokenizer inspection); (2) systematic variation of prompt phrasing (zero-shot, few-shot, chain-of-thought) with performance stability reported; (3) an ablation that preserves lexical items but disrupts morphosyntax (e.g., random word-order scrambling within clauses). We will also report the exact substitution procedure (uniform sampling over a fixed nonsense lexicon) and test robustness across different nonsense-generation distributions. These controls will be presented alongside the main results to isolate the contribution of structural cues. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical demonstration reporting LLM performance on constructed jabberwockified inputs. It contains no mathematical derivations, equations, fitted parameters, or self-referential definitions. The central claim follows directly from the observed translations without any reduction to inputs by construction, self-citation chains, or smuggled ansatzes. The provided abstract and framing rely on direct evidence rather than any load-bearing logical step that collapses into its own premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical demonstration using existing LLMs and relies on standard assumptions about language models and linguistic structure without introducing new free parameters or invented entities.

axioms (1)

domain assumption Large language models trained on broad text corpora can leverage contextual, syntactic, and world-knowledge cues to infer missing lexical items.
This underpins the expectation that LLMs will succeed on the degraded inputs.

pith-pipeline@v0.9.0 · 5486 in / 1295 out tokens · 40986 ms · 2026-05-15T18:53:03.501462+00:00 · methodology

The Astonishing Ability of Large Language Models to Parse Jabberwockified Language

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)