arxiv: 2602.04081 · v1 · submitted 2026-02-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Abstraction Induces the Brain Alignment of Language and Speech Models

Emily Cheng , Aditya R. Vaidya , Richard Antonello

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords brain alignmentlanguage modelsspeech modelsintrinsic dimensionsemantic abstractionfMRIECoGlayerwise analysis

0 comments

The pith

Brain alignment arises from meaning abstraction in middle layers of language and speech models, not from next-word prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that intermediate layers in language and speech models best predict brain responses to language because those layers build higher-order semantic features. This is indexed by a peak in layerwise intrinsic dimension, a measure of feature complexity, and the strength of a layer's brain predictivity tracks its intrinsic dimension closely. The link between intrinsic dimension and brain alignment strengthens across pre-training, and fine-tuning a model to improve brain prediction directly raises both its intrinsic dimension and the semantic content of its representations. The results indicate that semantic richness, rather than the surface task of next-token prediction, is what produces the observed model-brain correspondence.

Core claim

Intermediate hidden states from language and speech models predict measured brain responses because models construct higher-order linguistic features in their middle layers. This construction is cued by a peak in layerwise intrinsic dimension. A layer's intrinsic dimension strongly predicts how well it explains fMRI and ECoG signals; the relation between intrinsic dimension and brain predictivity emerges over pre-training; and fine-tuning models to better predict the brain causally increases both representations' intrinsic dimension and their semantic content. Semantic richness, high intrinsic dimension, and brain predictivity therefore mirror one another.

What carries the argument

Layerwise intrinsic dimension, which measures the complexity of features extracted at each layer and peaks in the middle layers where higher-order linguistic abstractions are formed.

If this is right

Layers with higher intrinsic dimension explain more variance in brain signals from fMRI and ECoG recordings.
The relationship between intrinsic dimension and brain predictivity strengthens as models undergo pre-training.
Fine-tuning a model to improve brain prediction increases both its intrinsic dimension and the semantic richness of its representations.
Next-word prediction alone is insufficient to produce strong brain alignment without the accompanying development of high-dimensional semantic features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained on other sufficiently complex tasks that force abstraction could produce similar brain alignment without language modeling objectives.
Architectures that explicitly encourage high intrinsic dimension in middle layers might achieve better brain alignment with less data.
The same intrinsic-dimension signature could be used to identify which layers to extract for downstream applications that aim to mimic human semantic processing.

Load-bearing premise

That peaks in intrinsic dimension directly index the semantic abstraction relevant to brain processing and that the observed correlations reflect a causal driver rather than a side effect of model scale or training data.

What would settle it

An experiment that holds semantics fixed while altering intrinsic dimension across layers and checks whether brain predictivity still tracks the dimension measure.

read the original abstract

Research has repeatedly demonstrated that intermediate hidden states extracted from large language models and speech audio models predict measured brain response to natural language stimuli. Yet, very little is known about the representation properties that enable this high prediction performance. Why is it the intermediate layers, and not the output layers, that are most effective for this unique and highly general transfer task? We give evidence that the correspondence between speech and language models and the brain derives from shared meaning abstraction and not their next-word prediction properties. In particular, models construct higher-order linguistic features in their middle layers, cued by a peak in the layerwise intrinsic dimension, a measure of feature complexity. We show that a layer's intrinsic dimension strongly predicts how well it explains fMRI and ECoG signals; that the relation between intrinsic dimension and brain predictivity arises over model pre-training; and finetuning models to better predict the brain causally increases both representations' intrinsic dimension and their semantic content. Results suggest that semantic richness, high intrinsic dimension, and brain predictivity mirror each other, and that the key driver of model-brain similarity is rich meaning abstraction of the inputs, where language modeling is a task sufficiently complex (but perhaps not the only) to require it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Intrinsic dimension peaks in middle layers track brain predictivity in language and speech models, but the claim that this isolates abstraction from next-token prediction lacks separating controls.

read the letter

The main observation is that intrinsic dimension rises and then falls across layers in these models, with the peak layers also being the ones that best predict fMRI and ECoG responses. This pattern appears during pre-training, and fine-tuning the models to predict brain signals raises both the dimension and semantic content in the representations. That gives a concrete handle on why intermediate layers often win in brain alignment studies rather than the final output layers. The pre-training trajectory and the fine-tuning manipulation are the clearest new pieces here; they move beyond static layer correlations to show how the alignment property develops. The use of intrinsic dimension as a proxy for feature complexity is straightforward and reproducible enough to check. The soft spot is the interpretation that this reflects shared meaning abstraction independent of next-word prediction. Every model in the study is still a next-token predictor, so the high-dimension layers could simply be those that have learned richer predictive features rather than abstraction being the causal driver. No ablation holds prediction performance fixed while varying dimension or semantic richness, which leaves the separation untested. The circularity risk is also present if the dimension measure and brain predictivity end up capturing overlapping aspects of the same representations. This is worth a reading group to walk through the exact ID calculation and the fine-tuning setup. It is not yet something I would cite without tighter controls, but the core pattern is worth referee time because the experiments are checkable and the mechanistic angle is fresh enough to push on. Send it for review with a request for ablations that separate the factors.

Referee Report

3 major / 2 minor

Summary. The paper claims that alignment between intermediate layers of language and speech models with brain responses (fMRI and ECoG) arises from shared meaning abstraction rather than next-word prediction. This is cued by a peak in layerwise intrinsic dimension (ID) as a measure of feature complexity; ID strongly predicts brain predictivity, the ID-brain relation emerges during pre-training, and brain-supervised fine-tuning jointly increases ID and semantic content.

Significance. If the central results hold, the work supplies a mechanistic account linking representational complexity (via ID) to semantic abstraction and brain predictivity. It offers a potential explanation for why middle layers outperform output layers and could guide construction of more brain-aligned models. The joint observation that ID, semantic richness, and brain alignment co-vary is a useful empirical contribution.

major comments (3)

[Results on pre-training and fine-tuning] The inference that ID indexes semantic abstraction as the causal driver of brain alignment, independent of next-token prediction, rests on correlational evidence alone. All examined models remain next-token predictors; no control representations that achieve high ID without semantic structure (or low ID with semantic structure) are tested. This separation is load-bearing for the central claim.
[Methods (intrinsic dimension estimation)] The definition and computation of layerwise intrinsic dimension must be shown to be independent of the same representational properties used to predict brain signals. Without an explicit statement that ID estimation uses only geometric properties of the activation manifold and not the semantic labels or brain targets, the risk of circularity remains unaddressed.
[Fine-tuning experiments] Brain-supervised fine-tuning is reported to increase both ID and semantic content, yet the design does not include a matched control condition (e.g., fine-tuning on a non-semantic task that also raises ID) to test whether the ID increase is sufficient for improved brain predictivity.

minor comments (2)

[Abstract and Methods] The abstract refers to 'a peak in the layerwise intrinsic dimension' without stating the estimator, neighborhood size, or dimensionality-reduction step used; these parameters should appear in the main text before the first results figure.
[Figures and Tables] Figure legends and table captions should explicitly report the number of subjects, stimuli, and cross-validation folds for all brain-prediction correlations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that help clarify the scope of our claims. We address each major point below with clarifications from the manuscript and propose targeted revisions to improve transparency without overstating the evidence.

read point-by-point responses

Referee: [Results on pre-training and fine-tuning] The inference that ID indexes semantic abstraction as the causal driver of brain alignment, independent of next-token prediction, rests on correlational evidence alone. All examined models remain next-token predictors; no control representations that achieve high ID without semantic structure (or low ID with semantic structure) are tested. This separation is load-bearing for the central claim.

Authors: We agree that the evidence remains correlational and that all models use next-token prediction objectives, so we cannot fully isolate ID from semantics with the current experiments. The pre-training results show the ID-brain relation emerging as semantic features develop, and the fine-tuning results demonstrate joint increases in ID, semantic content, and alignment. In the revised manuscript we will add an explicit limitations paragraph acknowledging the absence of control representations (e.g., high-ID non-semantic embeddings) and will suggest such experiments as future work. We cannot introduce new control models in this revision. revision: partial
Referee: [Methods (intrinsic dimension estimation)] The definition and computation of layerwise intrinsic dimension must be shown to be independent of the same representational properties used to predict brain signals. Without an explicit statement that ID estimation uses only geometric properties of the activation manifold and not the semantic labels or brain targets, the risk of circularity remains unaddressed.

Authors: Intrinsic dimension is estimated via the TwoNN method applied solely to the raw activation vectors of each layer; the estimator uses only Euclidean nearest-neighbor distances within the activation manifold and incorporates no semantic labels, brain targets, or supervised signals. We will insert a new subsection in Methods that states this independence explicitly, reproduces the estimator formula, and confirms that brain predictivity analyses are performed after ID computation on held-out data. revision: yes
Referee: [Fine-tuning experiments] Brain-supervised fine-tuning is reported to increase both ID and semantic content, yet the design does not include a matched control condition (e.g., fine-tuning on a non-semantic task that also raises ID) to test whether the ID increase is sufficient for improved brain predictivity.

Authors: We concur that a matched non-semantic fine-tuning control would strengthen causal claims about ID sufficiency. Our current results show that brain-supervised fine-tuning simultaneously elevates ID, semantic richness, and alignment, consistent with the abstraction hypothesis. In revision we will expand the fine-tuning analysis with trajectory plots and add a discussion paragraph noting the missing control condition as a limitation while proposing it for future study. New control experiments cannot be added within the scope of this revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The paper computes layerwise intrinsic dimension directly from the geometry of hidden-state representations and separately measures brain predictivity via linear regression on the same states; the reported correlation, its emergence during pre-training, and the joint increase under brain-supervised fine-tuning are empirical observations rather than definitional reductions. No equation equates ID to brain predictivity by construction, no parameter is fitted on brain data and then relabeled as an independent prediction, and no load-bearing uniqueness claim rests solely on self-citation. The argument that abstraction (indexed by ID) rather than next-token prediction drives alignment is supported by the timing and intervention results, even if those results leave room for alternative interpretations; such evidential gaps do not constitute circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that intrinsic dimension quantifies semantic feature complexity and that correlations with brain signals indicate the causal driver of alignment.

axioms (1)

domain assumption Intrinsic dimension of hidden states measures the complexity of higher-order linguistic features relevant to brain responses
Invoked to explain why middle layers outperform output layers.

pith-pipeline@v0.9.0 · 5517 in / 1153 out tokens · 21264 ms · 2026-05-16T07:35:18.463275+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

models construct higher-order linguistic features in their middle layers, cued by a peak in the layerwise intrinsic dimension... semantic richness, high intrinsic dimension, and brain predictivity mirror each other
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the relation between intrinsic dimension and brain predictivity arises over model pre-training... finetuning models to better predict the brain causally increases both representations' intrinsic dimension and their semantic content

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.