pith. sign in

arxiv: 2604.04469 · v1 · submitted 2026-04-06 · 💻 cs.CL · q-bio.QM

Same Geometry, Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability

Pith reviewed 2026-05-10 20:08 UTC · model grok-4.3

classification 💻 cs.CL q-bio.QM
keywords scalar variabilitytransformer representationsmagnitude representationsnumerical cognitionhidden state dispersionanti-scalar patterndistributional learning
0
0 comments X

The pith

Transformer language models show decreasing representational variability with larger numerical magnitudes, the opposite of scalar variability in biological systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether transformer models display scalar variability, the biological pattern in which representational noise grows proportionally with magnitude to maintain a constant coefficient of variation. By measuring the spread of hidden-state vectors for the same number across many different sentences, the authors instead observe that variability shrinks as magnitudes increase. This matters because the models already reproduce the log-compressive geometry of biological magnitude systems, yet fail to produce the matching noise signature. The finding indicates that statistical patterns in text alone are not enough to generate the full structure of human-like quantity representations.

Core claim

Analysis of hidden-state dispersion across carrier sentences for 26 numerical magnitudes in three large transformer models revealed a negative scaling exponent of approximately -0.19 for variability along the magnitude axis, with no layers showing positive scaling. This anti-scalar pattern held in full-dimensional space and after sentence-identity correction, was three to five times stronger on the magnitude axis than on orthogonal dimensions, and correlated strongly with corpus frequency of the numbers. The models thus share the log-compressive geometry of biological magnitude systems but lack their constant coefficient of variation in noise.

What carries the argument

Dispersion of hidden-state vectors across carrier sentences, used as a proxy for representational noise in numerical magnitude encodings.

If this is right

  • Distributional learning from language data alone cannot produce scalar variability.
  • Transformer magnitude representations exhibit log-compressive geometry without the matching noise profile.
  • Per-magnitude variability is strongly predicted by how frequently the number appears in the training corpus.
  • The decrease in variability is specific to and amplified along the magnitude axis compared to orthogonal dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Additional training objectives or data sources beyond text may be required to induce scalar variability in artificial systems.
  • The absence of this noise pattern could contribute to models' difficulties in tasks requiring intuitive numerical estimation.
  • Examining variability in other modalities or hybrid models might reveal conditions under which scalar variability emerges.

Load-bearing premise

Dispersion of hidden-state vectors across different carrier sentences serves as a comparable measure to the scalar variability observed in biological experiments.

What would settle it

Finding a positive scaling exponent where representational dispersion increases with magnitude, or a constant coefficient of variation across magnitudes in the model hidden states.

Figures

Figures reproduced from arXiv: 2604.04469 by Jon-Paul Cacioli.

Figure 1
Figure 1. Figure 1: Representational variability as a function of numerical magnitude (log–log axes) at layer [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Corpus frequency predicts representational variability. Each point is one of 26 numerical [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: E4: On-axis (PC1, magnitude direction) vs off-axis scaling exponent across primary [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: E6: Instruction tuning amplifies the anti-scalar pattern. Left: layerwise [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scaling exponent α across all layers for all three models. Left: Veucl (raw). Centre: Vresidual (sentence-corrected). Right: Vproj (magnitude axis). Dashed grey line: scalar prediction (α = 1). Dotted grey line: α = 0. All models show α < 0 at all primary layers across all measures. 4 Discussion Transformer language models reproduce the mean geometry of biological magnitude systems—log￾compressive encoding… view at source ↗
read the original abstract

Scalar variability -- the finding that representational noise scales proportionally with magnitude, producing a constant coefficient of variation -- is a hallmark of biological magnitude systems. We tested whether transformer language models exhibit this property by analysing the dispersion of hidden-state representations across carrier sentences for 26 numerical magnitudes in three 7-8B parameter models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base; data from Cacioli, 2026). We found the opposite: representational variability decreased with magnitude along the magnitude axis (scaling exponent alpha approx -0.19; 0/16 primary layers with alpha > 0, all three models). The negative sign was consistent in full-dimensional space (alpha approx -0.04) and after sentence-identity correction (alpha approx -0.007). The anti-scalar pattern was 3-5x stronger along the magnitude axis than orthogonal dimensions, and corpus frequency strongly predicted per-magnitude variability (rho = .84). These results demonstrate that distributional learning alone is insufficient to produce scalar variability: transformers reproduce log-compressive magnitude geometry but not the constant-CV noise signature observed in biological systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that transformer language models exhibit anti-scalar variability in magnitude representations, opposite to biological systems' constant coefficient of variation. Analyzing dispersion of hidden-state vectors for 26 numerical magnitudes across carrier sentences in Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, and Llama-3-8B-Base, it reports a negative scaling exponent (alpha ≈ -0.19 along the magnitude axis; 0/16 primary layers positive), consistent in full space (alpha ≈ -0.04) and after sentence correction (alpha ≈ -0.007). The effect is 3-5x stronger on the magnitude axis than orthogonal dimensions, with strong correlation to corpus frequency (rho = .84), leading to the conclusion that distributional learning alone cannot produce scalar variability.

Significance. If the dispersion measure validly isolates representational noise, the finding would demonstrate that transformers capture log-compressive magnitude geometry but lack the proportional noise scaling of biological systems, implying that statistical learning from text is insufficient for this cognitive signature. The consistency of the negative alpha across three models, full-dimensional space, and corrected conditions is a strength of the empirical design.

major comments (3)
  1. [Abstract] Abstract: The use of cross-sentence dispersion as a proxy for representational noise (comparable to biological constant-CV) is load-bearing for the claim that 'distributional learning alone is insufficient,' yet the reported rho = .84 correlation with corpus frequency indicates sensitivity to usage statistics; the sentence-identity correction attenuates alpha only to ≈ -0.007 without eliminating the negative sign or providing frequency-matched controls.
  2. [Abstract] Abstract and implied Results: The scaling exponent alpha ≈ -0.19 and the claim of consistency (0/16 layers with alpha > 0) are presented without error bars, confidence intervals, or statistical tests, making it impossible to evaluate whether the anti-scalar pattern is robust or driven by a subset of magnitudes/layers.
  3. [Abstract] Abstract: Details on carrier-sentence selection and magnitude-axis identification are absent; without these, it is unclear whether the 3-5x stronger effect along the magnitude axis versus orthogonal dimensions reflects intrinsic geometry or sentence-construction artifacts.
minor comments (1)
  1. Clarify the precise procedure for extracting the magnitude axis from hidden states and for computing per-magnitude dispersion (e.g., Euclidean distance, cosine, or variance along principal components).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments have prompted us to strengthen the statistical reporting, expand methodological details, and add controls for frequency effects. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: The use of cross-sentence dispersion as a proxy for representational noise (comparable to biological constant-CV) is load-bearing for the claim that 'distributional learning alone is insufficient,' yet the reported rho = .84 correlation with corpus frequency indicates sensitivity to usage statistics; the sentence-identity correction attenuates alpha only to ≈ -0.007 without eliminating the negative sign or providing frequency-matched controls.

    Authors: We agree that the rho = .84 correlation with corpus frequency indicates that representational variability is shaped by usage statistics, which is consistent with distributional learning. The sentence-identity correction attenuates but does not reverse the negative sign of alpha, which we interpret as evidence that the anti-scalar pattern is not solely an artifact of sentence identity. We acknowledge that frequency-matched controls would provide a stronger test of whether the effect exceeds what frequency alone predicts. In the revised manuscript we add a supplementary analysis that bins magnitudes by frequency and recomputes alpha within bins; the negative scaling remains within each bin, supporting the claim that distributional learning from text does not produce the constant-CV signature of biological systems. revision: partial

  2. Referee: The scaling exponent alpha ≈ -0.19 and the claim of consistency (0/16 layers with alpha > 0) are presented without error bars, confidence intervals, or statistical tests, making it impossible to evaluate whether the anti-scalar pattern is robust or driven by a subset of magnitudes/layers.

    Authors: We accept that the absence of uncertainty estimates and formal tests limits evaluation of robustness. In the revised manuscript we report bootstrap 95% confidence intervals for every scaling exponent (computed over 5,000 resamples of magnitudes and sentences) and include a permutation test (10,000 iterations) that evaluates the probability of observing zero or fewer layers with positive alpha under a null of no systematic scaling. These additions confirm that the reported alpha values and the 0/16 count are statistically reliable. revision: yes

  3. Referee: Details on carrier-sentence selection and magnitude-axis identification are absent; without these, it is unclear whether the 3-5x stronger effect along the magnitude axis versus orthogonal dimensions reflects intrinsic geometry or sentence-construction artifacts.

    Authors: We regret the insufficient detail in the abstract. The Methods section specifies that carrier sentences were generated from a small set of neutral syntactic templates (e.g., 'The number is X', 'There are X items') with lexical fillers drawn from a fixed vocabulary to reduce semantic bias, and the magnitude axis is defined as the first principal component of the concatenated hidden-state matrix across all 26 magnitudes and sentences. We have expanded the Methods with explicit template lists, pseudocode for axis extraction, and a supplementary figure showing the variance explained by the first PC versus subsequent dimensions. The 3-5x comparison is obtained by contrasting the scaling exponent on this PC against the mean exponent across all orthogonal PCs; the revised text now includes the exact computation and replication code. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement of representation dispersion

full rationale

The paper reports direct empirical measurements of hidden-state vector dispersion across carrier sentences for 26 numerical magnitudes in three transformer models. It computes scaling exponents (alpha ≈ -0.19 along the magnitude axis) and correlations (rho = .84 with corpus frequency) from the observed data. No equations, fitted parameters, or predictions are defined in terms of the target quantities; the central claims are statistical summaries of the input representations. The single self-citation to Cacioli 2026 supplies the dataset but does not bear the load of any derivation or uniqueness claim. The analysis is self-contained against external benchmarks and contains no self-definitional, fitted-input, or ansatz-smuggling steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are stated in the abstract; the work is purely empirical measurement of existing model representations.

pith-pipeline@v0.9.0 · 5512 in / 984 out tokens · 46222 ms · 2026-05-10T20:08:17.670279+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    Cacioli, J.-P. (2026). Weber’s law in transformer magnitude representations: Efficient cod- ing, representational geometry, and psychophysical laws in language models.arXiv preprint arXiv:2603.20642

  2. [2]

    Gallistel, C. R. and Gelman, R. (2000). Non-verbal numerical cognition: From reals to integers. Trends in Cognitive Sciences, 4(2):59–65

  3. [3]

    and Simoncelli, E

    Ganguli, D. and Simoncelli, E. P. (2014). Efficient sensory encoding and Bayesian inference with heterogeneous neural populations.Neural Computation, 26(10):2103–2134

  4. [4]

    Gibbon, J. (1977). Scalar expectancy theory and Weber’s law in animal timing.Psychological Review, 84(3):279–325

  5. [5]

    Meck, W. H. and Church, R. M. (1983). A mode control model of counting and timing processes. Journal of Experimental Psychology: Animal Behavior Processes, 9(3):320–334. 6

  6. [6]

    Pardo-Vazquez, J. L. et al. (2019). The mechanistic foundation of Weber’s law.Nature Neuro- science, 22:1493–1502. 7