pith. sign in

arxiv: 2504.01542 · v2 · pith:RBKZ3U6Znew · submitted 2025-04-02 · 💻 cs.CL

Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation

Pith reviewed 2026-05-25 08:11 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM pretrainingregister classificationlanguage variationdata curationmodel performancecorpus linguisticstext genres
0
0 comments X

The pith

Register of pretraining data shapes LLM performance, with opinion texts helping and news texts hurting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains small generative models on data sliced by register, a standard linguistic category for text genre, and measures the effect on standard benchmarks. It finds that single-register training sets produce weaker models than the full unfiltered corpus, yet certain registers outperform others and their combinations yield further gains. Opinion texts improve results while news texts drag them down. The work treats register as an explainer of why some data mixtures succeed or fail. It concludes that deliberate register-aware selection can improve future pretraining curation.

Core claim

Register classification applied to pretraining data reveals that the News register produces subpar model performance while the Opinion register, covering reviews and opinion blogs, is highly beneficial; combinations of How-to-Instructions, Informational Description, and Opinion yield major improvements over any single register, although the full unfiltered dataset still outperforms every restricted set.

What carries the argument

register classification of pretraining texts, used to create sliced training sets for small generative models and then evaluated on benchmarks.

If this is right

  • Models trained only on News register data underperform relative to those trained on other registers or the full set.
  • Adding Opinion-class data measurably raises performance across benchmarks.
  • Merging How-to-Instructions, Informational Description, and Opinion registers produces larger gains than any one of them alone.
  • Individual benchmark results differ by register, exposing distinct strengths and weaknesses in the resulting models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Binary quality filters could be replaced or augmented by register-aware filters that retain beneficial classes and down-weight others.
  • Register balance might serve as a new axis for dataset documentation alongside size and domain labels.
  • Future work could test whether register effects persist after continued pretraining or instruction tuning on the same base models.

Load-bearing premise

Effects observed when training small models on register-sliced data will transfer to the large-scale models and mixed corpora used in actual LLM production.

What would settle it

Train a production-scale model on a register-balanced or register-filtered corpus and compare its benchmark scores directly against an otherwise identical model trained on the unfiltered mixture; a null difference would falsify the claim that register substantially explains performance variation.

read the original abstract

Pretraining data curation is a cornerstone in Large Language Model (LLM) development, leading to growing research on quality filtering of large web corpora. From statistical quality flags to LLM-based labelling systems, datasets are divided into categories, frequently reducing to a binary: those passing the filters are deemed as valuable examples, others are discarded as useless or detrimental. However, a more detailed understanding of the contribution of different kinds of texts to model performance is still largely lacking. In this article, we present the first study utilising registers or genres - a widely used standard in corpus linguistics to model linguistic variation - to curate pretraining datasets and investigate the effect of register on the performance of LLMs. We train small generative models with register classified data and evaluate them using standard benchmarks, and show that the register of pretraining data substantially affects model performance. We uncover surprising relationships between the pretraining material and the resulting models: using the News register results in subpar performance, and on the contrary, including the Opinion class, covering texts such as reviews and opinion blogs, is highly beneficial. While a model trained on the entire unfiltered dataset outperforms those trained on datasets limited to a single register, combining well-performing registers like How-to-Instructions, Informational Description, and Opinion leads to major improvements. Furthermore, analysis of individual benchmark results reveals key differences in the strengths and drawbacks of specific register classes as pretraining data. These findings show that register is an important explainer of model variation and can facilitate more deliberate future data selection practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to present the first study using linguistic registers (genres) to curate and analyze pretraining data for LLMs. By training small generative models on register-classified subsets of data (e.g., News, Opinion, How-to-Instructions, Informational Description) and evaluating on standard benchmarks, it reports that register substantially affects performance: News yields subpar results, Opinion is highly beneficial, and mixtures of strong registers outperform both single-register and full unfiltered datasets. It concludes that register is an important explainer of model variation and can guide more deliberate data selection.

Significance. If the core empirical patterns hold under scaling, the work supplies a linguistically grounded dimension for pretraining data curation that goes beyond binary quality filters. It identifies differential impacts of specific registers on benchmark performance and demonstrates that targeted combinations can exceed the full corpus, offering a concrete, falsifiable basis for future mixture experiments.

major comments (2)
  1. [Abstract / Methods] Abstract and Methods: All reported results come from training small generative models on register-sliced subsets; the manuscript contains no scaling runs, mixture experiments at larger parameter counts, or explicit argument showing why the observed deltas (e.g., News subpar, Opinion beneficial) would persist under the data volumes and multi-register mixtures used in production LLMs. This directly limits the load-bearing claim that register 'substantially affects model performance' for LLMs.
  2. [Results] Results / Evaluation: The claims of 'major improvements' from combining How-to-Instructions, Informational Description, and Opinion, and of 'key differences in strengths and drawbacks,' rest on benchmark scores whose statistical reliability cannot be assessed because the manuscript provides no sample sizes, number of runs, error bars, or significance tests (as noted in the abstract-only review).
minor comments (2)
  1. [Abstract] The abstract refers to 'standard benchmarks' without naming them or indicating which register classes drive gains on which tasks; adding a table or section reference would improve clarity.
  2. [Introduction] Notation for register classes (e.g., 'Opinion class, covering texts such as reviews and opinion blogs') should be defined once with a reference to the classification scheme used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key limitations regarding experimental scale and statistical reporting. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods: All reported results come from training small generative models on register-sliced subsets; the manuscript contains no scaling runs, mixture experiments at larger parameter counts, or explicit argument showing why the observed deltas (e.g., News subpar, Opinion beneficial) would persist under the data volumes and multi-register mixtures used in production LLMs. This directly limits the load-bearing claim that register 'substantially affects model performance' for LLMs.

    Authors: We agree that the experiments use only small models and lack scaling runs or larger-mixture tests, which constrains direct extrapolation to production-scale LLMs. The study’s intent is to isolate register effects via controlled small-model training; we will revise the abstract, introduction, and conclusion to explicitly qualify all claims as applying to small models, add a limitations subsection discussing the absence of scaling evidence, and include a brief linguistic argument (drawing on register theory) for why the observed patterns may generalize. We cannot perform new large-scale runs, so the revision will temper rather than expand the load-bearing claims. revision: partial

  2. Referee: [Results] Results / Evaluation: The claims of 'major improvements' from combining How-to-Instructions, Informational Description, and Opinion, and of 'key differences in strengths and drawbacks,' rest on benchmark scores whose statistical reliability cannot be assessed because the manuscript provides no sample sizes, number of runs, error bars, or significance tests (as noted in the abstract-only review).

    Authors: We accept that single-run point estimates without error bars or significance tests weaken the reliability assessment. Each training condition was run once owing to compute limits. In revision we will (1) state the number of runs explicitly, (2) replace or qualify phrases such as “major improvements” with language indicating observed point-estimate differences, and (3) add a limitations paragraph noting the lack of variance estimates. These textual changes will be made; no new training runs are feasible. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of trained models on register slices

full rationale

The manuscript reports training small generative models on register-classified subsets of pretraining data and evaluating them directly on standard benchmarks. All reported relationships (e.g., News register yielding subpar results, Opinion class being beneficial, combinations of How-to-Instructions/Informational Description/Opinion improving performance) are obtained from these experiments. No equations, fitted parameters renamed as predictions, self-citation chains, uniqueness theorems, or ansatzes are invoked to derive the central claims; the results stand as direct empirical observations without reduction to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that established corpus-linguistic register categories are meaningful predictors of LLM learning outcomes; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Register categories from corpus linguistics accurately partition web text in ways that matter for model training dynamics.
    The entire experimental design groups data by these categories and attributes performance differences to them.

pith-pipeline@v0.9.0 · 5816 in / 1261 out tokens · 50925 ms · 2026-05-25T08:11:32.306228+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

    cs.CL 2026-05 unverdicted novelty 6.0

    The authors introduce a register-aware evaluation framework that compares LLM outputs to human reference corpora via Biber's lexico-grammatical features and MMD across five English registers.

  2. Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance

    cs.CL 2026-04 unverdicted novelty 6.0

    The LENS framework applied to 192 real-world settings shows moderate natural prompt distribution shifts cause 73% average performance loss in deployed LLMs, especially across user groups and regions.