pith. sign in

arxiv: 2510.13928 · v2 · submitted 2025-10-15 · 💻 cs.CL · cs.AI

LLMs Can Get "Brain Rot": A Pilot Study on Twitter/X

Pith reviewed 2026-05-18 07:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelscontinual pre-trainingdata qualitycognitive declinereasoning benchmarksTwitter datasafety evaluations
0
0 comments X

The pith

Continual pre-training on junk Twitter text causes lasting cognitive decline in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models experience cognitive decline from continual training on low-quality social media text. The authors build junk datasets from real Twitter posts using high engagement levels or low semantic quality and compare them to matched control sets with identical token counts and training steps. Training four models on increasing shares of the junk data produces clear drops in reasoning, long-context understanding, and safety while raising dark traits such as narcissism, and the drops grow larger with more junk included. Error analysis shows models increasingly skip steps in reasoning chains. The work indicates that the quality of data used for ongoing model updates can produce persistent negative effects on capabilities.

Core claim

The authors establish that continual pre-training on Twitter data labeled as junk through either high engagement or low semantic quality produces non-trivial declines on reasoning benchmarks, long-context understanding, and safety evaluations while also increasing dark traits, with the magnitude of decline rising in proportion to the fraction of junk data used.

What carries the argument

The construction and use of junk and reverse-controlled Twitter datasets based on engagement degree and semantic quality, applied via controlled continual pre-training on four LLMs.

If this is right

  • Reasoning performance on ARC-Challenge with chain-of-thought drops from 72.1 to 57.2 as the junk ratio rises from 0% to 100% under the engagement measure.
  • Models increasingly truncate or skip steps in reasoning chains as the primary form of error.
  • Additional instruction tuning and clean continual pre-training produce partial recovery but leave residual deficits below the original baseline.
  • Tweet popularity predicts the size of the decline better than tweet length does.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training data pipelines for ongoing model updates may need explicit filters based on semantic quality to limit gradual capability loss.
  • Similar degradation patterns could appear with low-quality text from other social platforms that reward engagement over depth.
  • Periodic checks on fixed reasoning and safety benchmarks could become routine to track model health during continual pre-training.

Load-bearing premise

The observed performance drops are caused by the engagement or semantic properties of the junk text rather than by unmatched statistical properties of the Twitter corpora or by the specific training schedule.

What would settle it

An experiment that matches the junk and control datasets on every statistical property including token frequencies and sequence statistics, then applies identical training, and still finds no performance difference would indicate the effect is not due to junk content.

read the original abstract

We propose and test the LLM Brain Rot Hypothesis: continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs). To unveil junk effects, we designed a novel controlled experiment on real Twitter/X corpora, by constructing junk and reverse-controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality), with matched token scale and training operations across conditions. Compared to the control group, continual pre-training of 4 LLMs on the junk dataset causes non-trivial declines (Hedges' g>0.3) on reasoning, long-context understanding, safety, and inflating "dark traits" (e.g., psychopathy, narcissism). The gradual mixtures of junk and control datasets also yield dose-response cognition decay: for example, under M1, ARC-Challenge with Chain-of-Thought drops 72.1 -> 57.2 and RULER-CWE 83.7 -> 52.3 as junk ratio rises from 0% to 100%. Error forensics reveal several key insights. First, we identify thought-skipping as the primary lesion in reasoning: models increasingly truncate or skip chains. Second, partial but incomplete healing is observed: scaling instruction tuning and clean continual pre-training improve the declined cognition, yet cannot restore baseline capability, suggesting persistent representational drift rather than format mismatch. Finally, we discover that the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1. Together, the results provide significant, multi-perspective evidence that social effects of data could be a causal driver of LLM capability decay in continual pre-training, thereby motivating routine "cognitive health checks" for deployed and evolving LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents a pilot study testing the LLM Brain Rot Hypothesis: continual pre-training on junk Twitter/X text causes lasting cognitive decline in LLMs. Using two orthogonal definitions of junk (M1: engagement degree; M2: semantic quality) with matched token counts and training operations, the authors continually pre-train 4 LLMs and report non-trivial declines (Hedges' g > 0.3) on reasoning (ARC-Challenge CoT drops from 72.1 to 57.2), long-context understanding (RULER-CWE from 83.7 to 52.3), safety benchmarks, and inflated dark traits as junk ratio rises from 0% to 100%. They further identify thought-skipping as the dominant reasoning error, observe partial recovery via instruction tuning and clean pre-training, and note that tweet popularity outperforms length as a predictor under M1.

Significance. If the results hold, the work supplies multi-perspective empirical evidence that engagement and semantic properties of web data can causally degrade LLM capabilities during continual pre-training, beyond simple token-volume effects. Strengths include the orthogonal junk operationalizations, explicit dose-response curves, error forensics, and recovery experiments; these elements make the design more informative than single-condition comparisons. The pilot scope with four models limits generalizability but usefully motivates routine cognitive-health monitoring for evolving LLMs.

major comments (2)
  1. [Methods / Dataset Construction] Dataset construction and matching procedure: the central claim requires that observed declines on ARC-Challenge, RULER, and safety suites are driven by the semantic/engagement properties of the junk text rather than unmatched higher-order corpus statistics. Token counts and training operations are matched, yet no balancing or reporting is provided for perplexity under a reference LM, n-gram distributions, lexical diversity, topic distribution, or syntactic complexity between junk and control sets. These unmeasured differences could produce the reported performance drops independently of the junk labels.
  2. [Results / Error Forensics] Results section on error forensics: the identification of thought-skipping as the primary lesion is presented as a key insight, but the manuscript does not specify the a priori criteria or annotation protocol used to detect and quantify truncation or skipping of reasoning chains across conditions. Without this, it is unclear whether the pattern was hypothesized before inspection or emerged post-hoc, affecting the strength of the mechanistic interpretation.
minor comments (3)
  1. [Abstract] Abstract and results: the reported metric drops and Hedges' g values would be more interpretable with accompanying standard errors or confidence intervals; their absence on all metrics makes it harder to judge the reliability of the dose-response trends.
  2. [Evaluation Metrics] Throughout: the exact instruments and scoring procedures for safety benchmarks and dark-trait measures (psychopathy, narcissism) should be stated explicitly, including any prompt templates or evaluation rubrics.
  3. [Discussion] Discussion: the suggestion of persistent representational drift versus format mismatch could be strengthened by reporting representation-similarity or probing analyses before and after the continual pre-training stages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful review. We appreciate the acknowledgment of the study's design strengths, including the orthogonal junk operationalizations and dose-response analyses. We address each major comment below with proposed revisions to enhance clarity and robustness.

read point-by-point responses
  1. Referee: [Methods / Dataset Construction] Dataset construction and matching procedure: the central claim requires that observed declines on ARC-Challenge, RULER, and safety suites are driven by the semantic/engagement properties of the junk text rather than unmatched higher-order corpus statistics. Token counts and training operations are matched, yet no balancing or reporting is provided for perplexity under a reference LM, n-gram distributions, lexical diversity, topic distribution, or syntactic complexity between junk and control sets. These unmeasured differences could produce the reported performance drops independently of the junk labels.

    Authors: We agree that explicit controls for higher-order corpus statistics would strengthen isolation of the junk properties' effects. While the consistent degradation patterns across two orthogonal junk definitions (M1 engagement and M2 semantic quality) provide evidence against purely superficial confounds, we did not report perplexity, n-gram distributions, lexical diversity, topic distributions, or syntactic complexity in the original manuscript. In the revised version, we will add these analyses (e.g., perplexity under a held-out reference LM, type-token ratios, and topic model comparisons) for the junk and control sets in the Methods section or Appendix to directly address this concern. revision: partial

  2. Referee: [Results / Error Forensics] Results section on error forensics: the identification of thought-skipping as the primary lesion is presented as a key insight, but the manuscript does not specify the a priori criteria or annotation protocol used to detect and quantify truncation or skipping of reasoning chains across conditions. Without this, it is unclear whether the pattern was hypothesized before inspection or emerged post-hoc, affecting the strength of the mechanistic interpretation.

    Authors: The thought-skipping pattern was identified through systematic manual review of model outputs on reasoning tasks, comparing error types across junk ratios. We defined it as abrupt truncation of reasoning chains without logical completion (distinct from factual errors or format violations). While the analysis had an exploratory component, we will revise the manuscript to explicitly document the annotation protocol, including the predefined error categories, sample sizes inspected per condition, and inter-annotator agreement if applicable. This will be added to the error forensics subsection to improve transparency and reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical measurements on external benchmarks

full rationale

The paper reports results from a controlled empirical study: junk and reverse-controlled Twitter datasets are constructed via two operationalizations (M1 engagement, M2 semantic quality), token scale and training operations are matched, four LLMs undergo continual pre-training, and performance is measured on independent external suites (ARC-Challenge, RULER, safety benchmarks). Declines, dose-response curves, thought-skipping observations, and partial healing under instruction tuning are direct experimental outcomes, not reductions of any internal equation or fitted parameter to itself. No self-definitional loops, fitted-input predictions, or load-bearing self-citation chains appear in the reported chain; the central claim rests on observable differences against fixed external metrics rather than on any ansatz or uniqueness theorem imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the assumption that the two operationalizations of junk (engagement degree and semantic quality) isolate the causal factor responsible for capability decay, and that the matched training operations eliminate other sources of performance difference.

axioms (1)
  • domain assumption High-engagement or low-semantic-quality tweets constitute junk text that induces cognitive decline when used for continual pre-training
    This premise is invoked when constructing the junk dataset and attributing observed declines to it rather than to other corpus statistics.
invented entities (1)
  • LLM Brain Rot no independent evidence
    purpose: Label for the hypothesized lasting cognitive decline induced by junk text
    New descriptive term introduced to organize the observed declines in reasoning, safety, and personality traits; no independent falsifiable prediction outside the current experiments is provided.

pith-pipeline@v0.9.0 · 5874 in / 1419 out tokens · 33297 ms · 2026-05-18T07:10:54.416097+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Impact of AI-Generated Text on the Internet

    cs.CY 2026-04 unverdicted novelty 7.0

    By mid-2025 roughly 35% of new websites are AI-generated or AI-assisted, correlating with lower semantic diversity and higher positive sentiment but showing no significant drop in factual accuracy or stylistic diversity.

  2. LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

    cs.CV 2026-03 unverdicted novelty 7.0

    KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.

  3. State Contamination in Memory-Augmented LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.

  4. Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

    cs.LG 2026-04 unverdicted novelty 6.0

    RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...