Emergent Hierarchical Structure in Large Language Models: An Information-Theoretic Framework for Multi-Scale Representation

Kemu Xu; Qi Dong; Yukin Zhang

arxiv: 2505.18244 · v3 · submitted 2025-05-23 · 💻 cs.CL · cs.AI

Emergent Hierarchical Structure in Large Language Models: An Information-Theoretic Framework for Multi-Scale Representation

Yukin Zhang , Qi Dong , Kemu Xu This is my paper

Pith reviewed 2026-05-19 13:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelstransformer architecturehierarchical structureinformation bottleneckphase transitionslayer boundariesmulti-scale representationfunctional segments

0 comments

The pith

Language models spontaneously divide their layers into local, intermediate, and global segments whose locations and brittleness are set by architecture family rather than size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Transformer-based language models from different families form consistent functional boundaries across their layers, creating three distinct processing segments for local patterns, intermediate features, and global context. These boundaries and the sensitivity of each segment to perturbations turn out to depend far more on the architecture family than on parameter count or training setup. By framing the model as a hierarchical variational information bottleneck, the authors derive and confirm predictions about phase transitions and brittleness ratios that hold across eight models from 7B to 70B parameters. A reader would care because this reframes scaling discussions around architecture-specific compression rules instead of raw size.

Core claim

Every examined model develops two phase-transition boundaries that partition layers into Local, Intermediate, and Global segments. Llama-family boundary positions remain stable across a 10x parameter range with coefficients of variation between 0.067 and 0.095, while Qwen-family positions vary widely with coefficients between 0.465 and 0.726. Local-segment brittleness differs by a factor of 493 between the two families, a spread that architecture alone accounts for and that exceeds within-family or scale-based differences.

What carries the argument

Multi-Scale Probabilistic Generation Theory (MSPGT), which models an autoregressive Transformer as a Hierarchical Variational Information Bottleneck system and derives tiered predictions about information compression and resulting functional boundaries.

If this is right

Boundary locations and segment properties stay consistent within an architecture family even when model size changes by an order of magnitude.
Responses to layer-specific perturbations can be predicted from architecture family alone.
Information compression in these models occurs through discrete functional stages rather than smooth layer-to-layer gradients.
Architecture choice controls robustness properties more strongly than scaling does.
The theory supplies concrete, falsifiable predictions about where phase transitions should appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model designers could choose architecture families to achieve targeted stability in layer functions without needing larger scale.
Similar multi-scale compression patterns may appear in other sequence-processing systems beyond transformers.
If training effects can be ruled out, the results point to intrinsic differences in how families handle sequential information.
Testing additional architecture families would clarify whether the two-boundary pattern is general or specific to the studied designs.

Load-bearing premise

The detected phase-transition boundaries and brittleness differences arise from architecture-driven information compression mechanisms rather than from differences in training data, optimization, or measurement choices.

What would settle it

Recompute boundary positions and local-segment brittleness on new models from the same families after swapping training datasets or optimizers and check whether the positions and ratios still match the original family predictions.

read the original abstract

Why do language models from different architecture families respond so differently to the same perturbation? We argue that the answer is not scale, but \emph{how architecture shapes information compression}. Analyzing eight Transformer models (7B--70B parameters) from the Llama and Qwen families, we show that every model spontaneously develops discrete functional boundaries dividing its layers into Local, Intermediate, and Global processing segments -- yet boundary locations and per-segment brittleness are determined overwhelmingly by architecture family rather than model size or training configuration. We formalize this regularity as the \textbf{Multi-Scale Probabilistic Generation Theory} (MSPGT), which models an autoregressive Transformer as a Hierarchical Variational Information Bottleneck system and derives a tiered set of falsifiable predictions. Three predictions are strongly confirmed: all eight models exhibit two prominent phase-transition boundaries (P1.1); Llama boundary positions are stable across a $10{\times}$ parameter range ($\mathrm{CV}{=}0.067$--$0.095$) while Qwen positions vary widely ($\mathrm{CV}{=}0.465$--$0.726$), precisely matching our strong- and weak-dominance conditions; and cross-architecture local-segment brittleness spans \textbf{three orders of magnitude} ($493{\times}$ ratio) -- a gap that architecture family alone predicts and that dwarfs any within-family or scale-driven variation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM layers show family-specific phase boundaries and a huge brittleness gap, but the architecture attribution lacks controls that rule out pretraining data differences.

read the letter

The main thing to know is that the paper reports every tested model develops two clear phase-transition boundaries that split its layers into local, intermediate, and global segments, with Llama boundaries staying stable across sizes while Qwen boundaries shift and local-segment brittleness differs by a factor of 493 between families. They present this as support for their MSPGT framing of transformers as hierarchical variational information bottlenecks and check three specific predictions on eight models from 7B to 70B.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Multi-Scale Probabilistic Generation Theory (MSPGT), framing autoregressive Transformers as Hierarchical Variational Information Bottleneck systems. Analyzing eight models (7B–70B) from the Llama and Qwen families, it claims that every model spontaneously develops two discrete functional boundaries that partition layers into Local, Intermediate, and Global processing segments. Boundary locations are reported as stable across a 10× parameter range in Llama (CV = 0.067–0.095) but highly variable in Qwen (CV = 0.465–0.726), while local-segment brittleness exhibits a 493× cross-family ratio; both patterns are attributed to architecture family rather than scale or training configuration. Three MSPGT-derived predictions are stated to be strongly confirmed, including the universal presence of the two phase-transition boundaries.

Significance. If the quantitative patterns hold after methodological clarification, the work supplies a concrete information-theoretic account of why architecture families differ in robustness, moving beyond scale-centric narratives. The emphasis on falsifiable predictions, the reported CV contrasts, and the large brittleness gap constitute measurable, architecture-linked regularities that could guide both interpretability research and model design. The formal MSPGT framing also offers a potential bridge between variational information-bottleneck ideas and empirical layer-wise statistics.

major comments (3)

[§4] §4 (Boundary Detection and Phase-Transition Identification): The method used to locate the two phase-transition boundaries and assign Local/Intermediate/Global segments is not described with sufficient specificity (e.g., exact mutual-information or compression statistic, threshold or statistical test for P1.1, and whether segment definitions were fixed before or after inspecting the data). Because the central claim that boundary positions are architecture-determined rests on these locations being reproducible and unbiased, the absence of these details is load-bearing.
[§5.1 and §5.2] §5.1 and §5.2 (Architecture-Family Attribution): The manuscript asserts that boundary stability and the 493× local-segment brittleness ratio are determined overwhelmingly by architecture family rather than training configuration or data. No ablations, matched-data controls, or regression analyses isolating architectural choices (attention variant, normalization, etc.) from pretraining-corpus differences between Llama and Qwen families are reported. This omission directly undermines the causal attribution required by the MSPGT predictions.
[§5.3] §5.3 (Brittleness Quantification): The 493× cross-family brittleness ratio is presented as a key confirming result, yet the precise perturbation protocol, the brittleness metric, and any statistical controls (multiple-testing correction, within-family variance) are not detailed. Without these, it is impossible to assess whether the ratio is robust or sensitive to measurement choices that could correlate with unaccounted family-level differences.

minor comments (2)

[Abstract] The abstract introduces MSPGT without spelling out the acronym on first use; this should be expanded for readability.
[Figures] Figures displaying layer-wise information metrics should include per-model or per-family error bars or confidence intervals to allow visual assessment of the reported CV values.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major point below and have made revisions to enhance methodological transparency while preserving the core claims supported by the data.

read point-by-point responses

Referee: [§4] §4 (Boundary Detection and Phase-Transition Identification): The method used to locate the two phase-transition boundaries and assign Local/Intermediate/Global segments is not described with sufficient specificity (e.g., exact mutual-information or compression statistic, threshold or statistical test for P1.1, and whether segment definitions were fixed before or after inspecting the data). Because the central claim that boundary positions are architecture-determined rests on these locations being reproducible and unbiased, the absence of these details is load-bearing.

Authors: We agree that the original description of the boundary detection procedure was insufficiently detailed for full reproducibility. In the revised manuscript, §4 has been expanded with a new subsection that specifies: the exact mutual-information and compression statistics computed at each layer, the precise threshold and statistical criterion (including the test for P1.1) used to identify the two phase-transition boundaries, and explicit confirmation that the Local/Intermediate/Global segment definitions were fixed a priori from the MSPGT framework before any data inspection. These changes ensure the reported boundary locations are determined in a reproducible and unbiased manner. revision: yes
Referee: [§5.1 and §5.2] §5.1 and §5.2 (Architecture-Family Attribution): The manuscript asserts that boundary stability and the 493× local-segment brittleness ratio are determined overwhelmingly by architecture family rather than training configuration or data. No ablations, matched-data controls, or regression analyses isolating architectural choices (attention variant, normalization, etc.) from pretraining-corpus differences between Llama and Qwen families are reported. This omission directly undermines the causal attribution required by the MSPGT predictions.

Authors: We acknowledge that explicit component-wise ablations or perfectly matched-data controls would strengthen causal claims. The present study instead exploits the natural experimental contrast between two architecture families across a 10× scale range, documenting high within-family consistency (low CV in Llama) alongside large between-family divergence (high CV and brittleness ratio in Qwen). In the revision we have added a dedicated limitations paragraph that discusses potential confounds from pretraining corpus differences and notes the absence of full architectural ablations as a direction for future work, while arguing that the scale-controlled, family-level patterns remain consistent with the MSPGT predictions. revision: partial
Referee: [§5.3] §5.3 (Brittleness Quantification): The 493× cross-family brittleness ratio is presented as a key confirming result, yet the precise perturbation protocol, the brittleness metric, and any statistical controls (multiple-testing correction, within-family variance) are not detailed. Without these, it is impossible to assess whether the ratio is robust or sensitive to measurement choices that could correlate with unaccounted family-level differences.

Authors: We agree that the brittleness analysis requires fuller specification. The revised §5.3 now provides: the complete perturbation protocol (including the exact perturbation types applied to local segments), the formal definition of the brittleness metric (relative performance drop normalized by baseline), and the statistical controls (within-family variance estimates and confirmation that pre-specified comparisons obviated multiple-testing correction). These additions demonstrate that the reported 493× ratio is robust to the documented measurement choices and not driven by unaccounted family-level artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the presented derivation.

full rationale

The paper first reports empirical observations of phase-transition boundaries and brittleness differences across eight models from two architecture families. It then introduces MSPGT as a modeling framework (Hierarchical Variational Information Bottleneck) that organizes those observations into a set of tiered predictions. The three listed predictions concern the existence of boundaries, their cross-family stability differences, and the magnitude of brittleness gaps; these are checked against the same suite of models. Because the manuscript does not supply equations demonstrating that the boundary locations or brittleness metrics are algebraically identical to parameters fitted inside MSPGT itself, or that the predictions reduce by construction to the input measurements, the chain remains non-circular. The central claim rests on comparative statistics across independently trained model families rather than on self-definition or self-citation load-bearing steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on modeling autoregressive transformers as hierarchical variational information bottleneck systems and on the premise that observed layer divisions reflect functional information-processing stages.

free parameters (1)

phase-transition boundary positions
Locations appear identified empirically per architecture family and used to test stability predictions.

axioms (1)

domain assumption Autoregressive Transformers can be modeled as Hierarchical Variational Information Bottleneck systems.
This modeling choice underpins the MSPGT and its tiered predictions.

invented entities (1)

Multi-Scale Probabilistic Generation Theory (MSPGT) no independent evidence
purpose: Formal framework that models the transformer as a hierarchical information bottleneck and generates falsifiable predictions about layer boundaries.
New theory introduced to organize the observed regularities.

pith-pipeline@v0.9.0 · 5783 in / 1329 out tokens · 63743 ms · 2026-05-19T13:03:03.757668+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize this regularity as the Multi-Scale Probabilistic Generation Theory (MSPGT), which models an autoregressive Transformer as a Hierarchical Variational Information Bottleneck system and derives a tiered set of falsifiable predictions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.