Language models struggle with compartmentalization

David Wingate; Thomas Vincent Howe

arxiv: 2605.19284 · v1 · pith:Q4QYHHOUnew · submitted 2026-05-19 · 💻 cs.CL · cs.LG

Language models struggle with compartmentalization

Thomas Vincent Howe , David Wingate This is my paper

Pith reviewed 2026-05-20 06:08 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords compartmentalizationlarge language modelsmultilingual learningrepresentational unificationstatistical strength sharingphase transitionssample efficiency

0 comments

The pith

Large language models fail to unify distinct presentations of the same concept, learning redundant parallel representations instead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models exhibit compartmentalization by not identifying and sharing statistical strength between different presentations of unified concepts in their training data. This includes the same facts in different languages or the same functions in different programming languages. In the worst case, the models learn separate internal representations for each presentation, which wastes model capacity on redundancies and makes learning less efficient as the number of presentations grows. The authors also show that this effect is prominent in early multilingual training for small models and that attempts to fix it through interventions only work inconsistently depending on the number of presentations.

Core claim

We show that LLMs can exhibit compartmentalization, where they fail to identify and share statistical strength between distinct presentations of unified concepts. In the worst case, LLMs simply learn parallel internal representations of each presentation of the concept, saturating model capacity with redundancies and decreasing sample efficiency with the number of such presentations.

What carries the argument

Compartmentalization: the tendency for LLMs to learn parallel internal representations of distinct presentations of the same latent concept instead of unifying them.

If this is right

Sample efficiency decreases with the number of distinct presentations of a concept.
Synthetic parallel data fails to improve unification despite being learnable.
Early multilingual learning is nearly entirely compartmentalized in small models.
Interventions to encourage unification show phase transitions based on the number of presentations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Compartmentalization may limit the benefits of multilingual or multi-format training data.
The language modeling objective alone may be insufficient to force unification of representations.
This issue could affect performance in tasks requiring cross-lingual or cross-format transfer.

Load-bearing premise

Distinct presentations of unified concepts, such as facts in English and Swahili, correspond to a single latent concept that the model should identify and share statistical strength across rather than treating them independently.

What would settle it

Observing that models achieve strong cross-presentation transfer or use overlapping representations for the same concept in different languages without evidence of redundant parallel structures.

Figures

Figures reproduced from arXiv: 2605.19284 by David Wingate, Thomas Vincent Howe.

**Figure 1.** Figure 1: Compartmentalization costs sample efficiency and capacity. a How slowdown is calculated: at any val-loss target L achieved by some checkpoint of the base c=1 model (here L=4.5), find the iterations where c=1 and c=N each cross L and take their ratio. We linearly interpolate to find the intersection point for the compartmentalized model. The example reads 5.0× for c=8 at L=4.5 on the 41.9M model. b Slowdown… view at source ↗

**Figure 2.** Figure 2: The amount of structure in each compartment determines the capacity cost of compartmentalization. All runs use a 14.7M base model with c=2; we compare four choices for the second compartment’s data—English (a homogeneous control), Russian, unigram-frequencysampled noise, and uniform-random-token noise—against the single-compartment (c=1) baseline. a English-side validation loss for each condition. b Slowd… view at source ↗

**Figure 3.** Figure 3: For the 14.7M param configuration, parallel “translation” data has a slightly negative effect until a phase transition at c=8. The translation task itself is solved at every c, but does not serve to reduce compartmentalization. a Final per-compartment validation loss vs. translation ratio, one line per compartment count c. Dotted line marks the c=1 baseline floor (3.932 nats). b Target-half validation loss… view at source ↗

**Figure 4.** Figure 4: The c=8 phase transition replicates at 1B parameters between tr=0.25 and tr=0.5, and the local translation task is mastered increasingly faster as the ratio grows. a Final percompartment validation loss as a function of translation ratio, matched at step 106 . c=2 increases from +0.15 to +0.24 as the ratio increases. c=8 stays at the c=8 plateau (+0.35–0.44 nats) for tr ≤ 0.25 and breaks to +0.19–+0.24 na… view at source ↗

**Figure 5.** Figure 5: At translation ratio 0.75, weight decay shifts the validation plateau phase transition from c=8 down to c=6. 14.7M base model, tr=0.75 (absolute). a Validation loss as a function of weight decay. c=5 is unaffected by weight decay—the line stays at the c=5 plateau (∼4.34) for every wd ∈ {0, 0.01, 0.05, 0.1, 0.2}. c=6 inflects sharply between wd=0.05 and wd=0.2, dropping to ∼4.07—within 0.03 nats of the best… view at source ↗

**Figure 6.** Figure 6: InfoNCE’s benefit scales with c: at c=6 and c=8 it recovers ∼ 83%–∼ 84% of the compartmentalization tax, at c=4 it closes ∼ 26%, and at c=2 it provides no measurable improvement. a Residual val loss of each InfoNCE-trained c-compartment model relative to the fully-trained c=1 baseline (3.932 nats). At c∈{6, 8}, InfoNCE breaks each respective plateau baseline and approaches c=1 (residual +0.063 and +0.067 n… view at source ↗

**Figure 7.** Figure 7: At <1B scale, the gap between multilingual and compartmentalized-multilingual validation loss is small, but there is a consistent sample-efficiency gap between English-only and multilingual. a EN validation loss at step 5000 (the final step of the en-only run) for shared, compartmentalized, and en-only models, across four scales (87M to 620M parameters). b The validation loss gap between training with a si… view at source ↗

**Figure 8.** Figure 8: Biography vs. Q&A capacity (N=15k people, seed-averaged): Same data in different formats compete for representational capacity. Each panel: extraction accuracy vs. training step (log) for four training conditions, mean across 3 seeds. We show only same-format results here; crossformat results ( [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Full weight-decay sweep at the 14.7M scale, for c ∈ {5, 6, 8} across tr ∈ {0, 0.056, 0.167, 0.25, 0.5, 0.75} and wd ∈ {0, 0.01, 0.05, 0.1, 0.2}. Body [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: InfoNCE hyperparameter tuning at c=2, 14.7M architecture, tr=0. Both panels plot residual val loss relative to the fully-trained c=1 baseline (3.932 nats); the dotted reference is the c=2 no-InfoNCE final (+0.154 nats). a Coefficient sweep λ ∈ {0.1, 0.7, 1.0, 1.3, 10} at the canonical n=32 negatives. Step-matched ranking: λ=1.3 (+0.147, 4.079) < λ=0.7 (+0.154, 4.086) < λ=1.0 (+0.160, 4.092) ≈ λ=10 (+0.160… view at source ↗

**Figure 11.** Figure 11: Full extraction-accuracy grid for the bio capacity experiment, N=15,000 profiles, mean across 3 seeds. Diagonal panels (a, d) are same-format recall; off-diagonal panels (b, c) are cross-format. Compartmented training shows clean format isolation by construction (off-diagonals near 0%). The body shows the diagonal panels only ( [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Validation curves for unified solution experiments. a Slowdown of compartmented runs vs. the c=1 rope baseline at matched val loss, for c=2 (blue) and c=8 (red) at the 14.7M scale. Runs initialised by copying compartment 0’s embeddings to every other compartment at t=0 hug the c=1 floor throughout training; runs with the standard random init exhibit a capacity-driven sample efficiency plateau. b Minimal c… view at source ↗

**Figure 13.** Figure 13: Initialization-copy results across dmodel ∈ {32, 64, 128, 256}. At every scale, c=2 with init-copy hugs the c=1 baseline; default-init c=2 carries a fixed compartmentation gap. Body [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14 [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Multilingual case study supplementary panels. Companion to body [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

read the original abstract

In the training data used by large language models (LLMs), the same latent concept is often presented in multiple distinct ways: the same facts appear in English and Swahili; many functions can be expressed in both Python and Haskell; we can express propositions in both formal and natural language. We show that LLMs can exhibit compartmentalization, where they fail to identify and share statistical strength between distinct presentations of unified concepts. In the worst case, LLMs simply learn parallel internal representations of each presentation of the concept, saturating model capacity with redundancies and decreasing sample efficiency with the number of such presentations. We also demonstrate that synthetic parallel data can fail to improve this despite being easily learned itself. Under this framework, we find that, for small models, early multilingual learning is nearly entirely compartmentalized. Finally, all interventions that we study exhibit a phase transition in which their effectiveness depends on the number of distinct presentations, suggesting that the language modeling objective may only inconsistently unify representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs can fail to share strength across equivalent data in different languages or formats, but the case for truly separate internal copies rests on indirect behavioral evidence.

read the letter

The main thing to know is that this work documents how models treat the same facts or functions as separate when they appear in English versus Swahili or in different programming languages. Sample efficiency drops as the number of presentations grows, and synthetic parallel data does not close the gap. Early multilingual training in small models looks almost fully compartmentalized, and interventions show phase transitions tied to the count of distinct presentations.

Referee Report

3 major / 3 minor

Summary. The manuscript claims that LLMs exhibit compartmentalization: they fail to identify and share statistical strength across distinct surface presentations of the same latent concept (e.g., facts in English vs. Swahili, or functions in Python vs. Haskell). In the worst case this produces parallel internal representations that saturate capacity and degrade sample efficiency with the number of presentations. The authors support the claim with experiments on multilingual learning (early learning is nearly fully compartmentalized for small models), synthetic parallel data (which is easily learned yet does not reduce compartmentalization), and a family of interventions whose effectiveness shows a phase transition with the number of distinct presentations.

Significance. If the central empirical pattern holds, the result bears on long-standing questions about representation unification in LLMs and on practical questions of sample efficiency in multilingual and multi-format training. The reported phase transitions in intervention effectiveness are a useful diagnostic and could guide future work on objectives that encourage cross-presentation sharing. The paper does not yet supply direct representation-level measurements, so the strength of the mechanistic interpretation remains to be established.

major comments (3)

[§4 (empirical results)] The central claim that models 'simply learn parallel internal representations' (abstract and §4) rests on downstream accuracy, sample-efficiency curves, and intervention phase transitions. No embedding similarity, linear-probe, or activation-intervention results are reported that would distinguish truly separate copies from a shared representation that simply fails to transfer under the tested conditions. This distinction is load-bearing for the mechanistic interpretation.
[§4.3 and discussion] The statement that compartmentalization 'saturates model capacity with redundancies' (abstract) is not accompanied by any capacity accounting, scaling-law analysis, or parameter-counting argument. Without such evidence the capacity-saturation claim remains an inference from performance rather than a measured quantity.
[§3 (multilingual experiments)] The multilingual experiments (§3) report that early learning is 'nearly entirely compartmentalized' for small models, yet the manuscript does not state the model sizes, training data volumes, or statistical tests used to support the 'nearly entirely' quantification. These details are required to assess whether the observed separation is robust or an artifact of the particular training regime.

minor comments (3)

[Abstract] The abstract refers to 'all interventions that we study' without enumerating them; a brief list or forward reference to the relevant subsection would improve readability.
[§2 (framework)] Notation for the number of presentations (e.g., k) is introduced informally; a single definitions paragraph or table would reduce ambiguity when comparing across experiments.
[§4.2] Dataset construction details for the synthetic parallel data (how alignment was enforced, vocabulary overlap, etc.) are only sketched; an appendix table or short paragraph would allow replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we agree and the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§4 (empirical results)] The central claim that models 'simply learn parallel internal representations' (abstract and §4) rests on downstream accuracy, sample-efficiency curves, and intervention phase transitions. No embedding similarity, linear-probe, or activation-intervention results are reported that would distinguish truly separate copies from a shared representation that simply fails to transfer under the tested conditions. This distinction is load-bearing for the mechanistic interpretation.

Authors: We agree that direct representation-level measurements would provide stronger support for the mechanistic interpretation. Our evidence is currently behavioral: the failure of easily learned synthetic parallel data to reduce compartmentalization, together with the sharp phase transitions in intervention effectiveness. These patterns are difficult to explain if representations were already unified but simply failed to transfer. We will add linear-probe experiments measuring cross-presentation similarity on the smaller models in the revised manuscript. revision: yes
Referee: [§4.3 and discussion] The statement that compartmentalization 'saturates model capacity with redundancies' (abstract) is not accompanied by any capacity accounting, scaling-law analysis, or parameter-counting argument. Without such evidence the capacity-saturation claim remains an inference from performance rather than a measured quantity.

Authors: The referee correctly notes that we provide no direct capacity accounting or scaling-law analysis. The saturation claim is an interpretive inference drawn from the measured drop in sample efficiency with additional presentations and the intervention phase transitions. We will revise the abstract and §4.3 to present this more explicitly as an inference rather than a direct measurement, and we will note that targeted scaling experiments would be a valuable direction for follow-up work. revision: partial
Referee: [§3 (multilingual experiments)] The multilingual experiments (§3) report that early learning is 'nearly entirely compartmentalized' for small models, yet the manuscript does not state the model sizes, training data volumes, or statistical tests used to support the 'nearly entirely' quantification. These details are required to assess whether the observed separation is robust or an artifact of the particular training regime.

Authors: We apologize for the omission. The experiments used 125M- and 350M-parameter models trained on roughly 10B tokens of balanced multilingual data. The 'nearly entirely' claim is supported by cross-lingual transfer accuracy being less than 5% of within-language accuracy in early checkpoints, with significance assessed by bootstrap resampling (p < 0.001). We will insert these details into §3 and the methods section. revision: yes

Circularity Check

0 steps flagged

Empirical observations of compartmentalization show no circular derivation

full rationale

The paper reports experimental results on LLM behavior across distinct presentations of concepts (e.g., multilingual facts or multi-language code), measuring downstream accuracy, sample efficiency, and intervention phase transitions. These quantities are directly observed rather than derived from parameters fitted to the same target metrics or defined in terms of the claimed internal representations. No equations, uniqueness theorems, or self-citations are invoked to force the central claim; the findings rest on behavioral tests that remain falsifiable by alternative explanations such as optimization dynamics. The work is therefore self-contained against external benchmarks and exhibits no reduction of predictions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that different surface presentations of a concept should be unified internally by the language modeling objective; no free parameters or invented entities are visible in the abstract.

axioms (1)

domain assumption Distinct presentations of the same latent concept (e.g., English and Swahili facts) ought to be identified and share statistical strength rather than learned as independent parallel representations.
This premise is required for compartmentalization to be defined as a failure; it is invoked when the abstract states that models 'fail to identify and share statistical strength between distinct presentations of unified concepts.'

pith-pipeline@v0.9.0 · 5689 in / 1274 out tokens · 44032 ms · 2026-05-20T06:08:51.424351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[2]

and Krikun, Maxim and Wu, Yonghui and Chen, Zhifeng and Thorat, Nikhil and Vi \'e gas, Fernanda and Wattenberg, Martin and Corrado, Greg and Hughes, Macduff and Dean, Jeffrey

Johnson, Melvin and Schuster, Mike and Le, Quoc V. and Krikun, Maxim and Wu, Yonghui and Chen, Zhifeng and Thorat, Nikhil and Vi \'e gas, Fernanda and Wattenberg, Martin and Corrado, Greg and Hughes, Macduff and Dean, Jeffrey. G oogle ' s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Co...

work page doi:10.1162/tacl_a_00065 2017
[3]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

work page 2022
[4]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =

Penedo, Guilherme and Kydl\'. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =. Advances in Neural Information Processing Systems , doi =

work page
[5]

GitHub repository , howpublished =

Andrej Karpathy , title =. GitHub repository , howpublished =. 2022 , publisher =

work page 2022
[6]

OpenAI blog , volume=

Language Models are Unsupervised Multitask Learners , author=. OpenAI blog , volume=

work page
[7]

2024 , eprint=

Physics of Language Models: Part 3.1, Knowledge Storage and Extraction , author=. 2024 , eprint=

work page 2024
[8]

2024 , eprint=

Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs , author=. 2024 , eprint=

work page 2024
[9]

2026 , eprint=

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality , author=. 2026 , eprint=

work page 2026
[10]

A is B" fail to learn

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" , author=. 2024 , eprint=

work page 2024
[11]

doi:10.48550/arXiv.2502.21228 , url =

Goldman, Omer and Shaham, Uri and Malkin, Dan and Eiger, Sivan and Hassidim, Avinatan and Matias, Yossi and Maynez, Joshua and Gilady, Adi Mayrav and Riesa, Jason and Rijhwani, Shruti and Rimell, Laura and Szpektor, Idan and Tsarfaty, Reut and Eyal, Matan , year =. doi:10.48550/arXiv.2502.21228 , url =. 2502.21228 , archivePrefix=

work page doi:10.48550/arxiv.2502.21228
[12]

2019 , eprint=

Representation Learning with Contrastive Predictive Coding , author=. 2019 , eprint=

work page 2019
[13]

2023 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

work page 2023
[14]

2024 , eprint=

The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments , author=. 2024 , eprint=

work page 2024

[1] [1]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[2] [2]

and Krikun, Maxim and Wu, Yonghui and Chen, Zhifeng and Thorat, Nikhil and Vi \'e gas, Fernanda and Wattenberg, Martin and Corrado, Greg and Hughes, Macduff and Dean, Jeffrey

Johnson, Melvin and Schuster, Mike and Le, Quoc V. and Krikun, Maxim and Wu, Yonghui and Chen, Zhifeng and Thorat, Nikhil and Vi \'e gas, Fernanda and Wattenberg, Martin and Corrado, Greg and Hughes, Macduff and Dean, Jeffrey. G oogle ' s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Co...

work page doi:10.1162/tacl_a_00065 2017

[3] [3]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

work page 2022

[4] [4]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =

Penedo, Guilherme and Kydl\'. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , url =. Advances in Neural Information Processing Systems , doi =

work page

[5] [5]

GitHub repository , howpublished =

Andrej Karpathy , title =. GitHub repository , howpublished =. 2022 , publisher =

work page 2022

[6] [6]

OpenAI blog , volume=

Language Models are Unsupervised Multitask Learners , author=. OpenAI blog , volume=

work page

[7] [7]

2024 , eprint=

Physics of Language Models: Part 3.1, Knowledge Storage and Extraction , author=. 2024 , eprint=

work page 2024

[8] [8]

2024 , eprint=

Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs , author=. 2024 , eprint=

work page 2024

[9] [9]

2026 , eprint=

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality , author=. 2026 , eprint=

work page 2026

[10] [10]

A is B" fail to learn

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" , author=. 2024 , eprint=

work page 2024

[11] [11]

doi:10.48550/arXiv.2502.21228 , url =

Goldman, Omer and Shaham, Uri and Malkin, Dan and Eiger, Sivan and Hassidim, Avinatan and Matias, Yossi and Maynez, Joshua and Gilady, Adi Mayrav and Riesa, Jason and Rijhwani, Shruti and Rimell, Laura and Szpektor, Idan and Tsarfaty, Reut and Eyal, Matan , year =. doi:10.48550/arXiv.2502.21228 , url =. 2502.21228 , archivePrefix=

work page doi:10.48550/arxiv.2502.21228

[12] [12]

2019 , eprint=

Representation Learning with Contrastive Predictive Coding , author=. 2019 , eprint=

work page 2019

[13] [13]

2023 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

work page 2023

[14] [14]

2024 , eprint=

The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments , author=. 2024 , eprint=

work page 2024