pith. machine review for the scientific record. sign in

arxiv: 2605.14025 · v1 · submitted 2026-05-13 · 🧬 q-bio.NC · cs.AI

Recognition: no theorem link

Do Language Models Align with Brains? Prediction Scores Are Not Enough

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:49 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.AI
keywords language modelsbrain alignmentneural predictioncontrol analysisnaturalistic datasetsprediction scoresmechanism strippingreliability ceilings
0
0 comments X

The pith

Language-model representations fail L-PACT alignment gates once nuisance controls and brain-brain ceilings are applied.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether high prediction scores between language models and brain recordings truly indicate that the models capture brain-relevant language computation. It introduces L-PACT, a framework that requires evidence to survive four gates: better-than-baseline prediction, reproduction of brain-to-brain relational patterns, survival after mechanism stripping, and normalization within brain-brain reliability ceilings. When applied to primary naturalistic datasets and derived model representations, every tested model row failed the full set of gates. All 146 integrated decisions were reclassified as control-explained rather than evidence of structural alignment.

Core claim

Across 414 predictive-control rows, 2304 relational profiles, 4320 mechanism-stripping rows, and 420 brain-brain ceiling rows, no real language-model representation passed the predictive, relational, mechanism-stripping, or operational reliability gates; all integrated outcomes were accounted for by nuisance baselines, acoustic-envelope controls, and brain-brain ceilings.

What carries the argument

L-PACT, a source-audited multi-gate framework that evaluates predictive accuracy against baselines, reproduction of brain-to-brain profiles, held-out scores after mechanism stripping, and normalization to brain-brain ceilings.

If this is right

  • Raw prediction scores alone cannot establish alignment because nuisance and acoustic controls fully account for the observed effects.
  • Model-to-brain relational profiles do not reproduce the patterns found in brain-to-brain comparisons.
  • Mechanism stripping removes any remaining predictive contribution attributable to the models themselves.
  • All tested representations fall inside or below brain-brain reliability ceilings once controls are applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Earlier studies reporting alignment on the basis of prediction scores alone may need re-examination with comparable control audits.
  • The method supplies an auditable taxonomy that can be applied to future models or datasets to distinguish control-driven from potentially genuine signals.
  • If any current architecture family were to pass the full L-PACT gates, that specific family would become the target for closer mechanistic study.

Load-bearing premise

The chosen nuisance baselines, acoustic-envelope gates, and brain-brain ceilings fully capture all alternative explanations for observed model-to-brain prediction scores without excluding genuine alignment signals.

What would settle it

A new language-model representation that passes all four L-PACT gates on the same primary naturalistic datasets, including outperforming controls in held-out predictions while reproducing brain-to-brain relational profiles, would falsify the control-explained classification.

Figures

Figures reproduced from arXiv: 2605.14025 by Xiao Jia.

Figure 1
Figure 1. Figure 1: L-PACT framework and source-audited dataset eligibility. (A) The evidence hierarchy separates predictive [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Assay sensitivity and positive controls. (A) Positive-control gate matrix for brain-brain reliability, brain [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Conventional-looking positives are downgraded by L-PACT. (A) Less stringent single-criterion rules count [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Final integrated decision, nonpassing taxonomy, and robustness. (A) All 146 integrated rows are control [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Brain-language model comparisons often interpret neural prediction scores as evidence that model representations capture brain-relevant language computation. We asked whether language models align with brains, and whether prediction scores are enough to support that claim, using L-PACT, a source-audited framework that evaluates predictive, relational, mechanism-stripping, and reliability-bounded evidence. Across primary naturalistic language neural datasets and derived language-model representations, L-PACT compared real model features with nuisance baselines and severe controls, tested whether model-to-brain profiles reproduced brain-to-brain patterns, recomputed held-out scores after mechanism stripping, and normalized evidence against brain-brain ceilings. The locked analysis set contains 414 predictive-control rows, 2304 relational profile rows, 4320 mechanism-stripping rows, 420 brain-brain ceiling rows, and 146 integrated decision rows. Assay-sensitivity checks showed that brain-brain reliability, brain-as-model run-to-run relational profiles, independent low-level neural and WAV-derived acoustic-envelope gates, and a deterministic implanted-signal simulation can produce positive evidence when expected. Nevertheless, no real model row passed the predictive, relational, mechanism-stripping, or operational Turing-bounded reliability gates; all 146 integrated rows were control-explained. Less stringent single-criterion rules would have counted raw positive predictive, relational, stripping-delta, and ceiling-normalized effects, but L-PACT downgraded them because controls explained the apparent evidence. In the analyzed derived artifact set, the tested language-model representations do not satisfy L-PACT alignment gates; apparent positives are converted into an auditable control-explained taxonomy rather than treated as structural alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the L-PACT framework to audit claims of alignment between language-model representations and brain activity in naturalistic language tasks. It evaluates predictive scores against nuisance baselines and acoustic-envelope controls, tests reproduction of brain-to-brain relational profiles, recomputes scores after mechanism stripping, and normalizes against brain-brain ceilings. Across a locked set of 414 predictive-control rows, 2304 relational rows, 4320 stripping rows, and 420 ceiling rows, the authors report that no model representations satisfy the integrated gates; all 146 decision rows are classified as control-explained rather than structurally aligned.

Significance. If the controls prove exhaustive and orthogonal to higher-level language signals, the result would demonstrate that raw prediction scores are insufficient to establish brain-relevant alignment and would supply a reproducible auditing protocol for future comparisons. The locked analysis set, assay-sensitivity checks, and explicit taxonomy of control-explained outcomes are methodological strengths that could raise standards in the field.

major comments (2)
  1. [Abstract] Abstract: the central claim that acoustic-envelope gates and nuisance baselines fully explain all model-to-brain scores without residual alignment rests on the untested premise that these controls contain no brain-relevant linguistic features; an explicit check (e.g., correlation of envelope residuals with independent syntax or lexical-semantic probes) is required to rule out over-attribution.
  2. [Abstract] Abstract (414 predictive-control and 4320 stripping rows): without the precise definitions of the low-level neural baselines and the exact procedure for mechanism stripping, it is impossible to verify that the controls are severe enough to isolate structural alignment rather than merely absorbing spectro-temporal variance that participates in cortical language processing.
minor comments (1)
  1. [Abstract] Abstract: define 'operational Turing-bounded reliability gates' more explicitly so readers can replicate the reliability normalization step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment point by point below, providing clarifications on the L-PACT controls and procedures. We agree that greater explicitness will strengthen the manuscript and have incorporated revisions to address the concerns.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that acoustic-envelope gates and nuisance baselines fully explain all model-to-brain scores without residual alignment rests on the untested premise that these controls contain no brain-relevant linguistic features; an explicit check (e.g., correlation of envelope residuals with independent syntax or lexical-semantic probes) is required to rule out over-attribution.

    Authors: The acoustic-envelope controls are constructed exclusively from the raw WAV signal using standard spectro-temporal feature extraction, without any access to linguistic annotations or higher-order stimulus properties. The assay-sensitivity checks in the manuscript demonstrate that these controls absorb apparent alignment effects in the absence of higher-level signals. To directly address the concern, the revised manuscript adds a supplementary analysis computing Pearson correlations between envelope residuals and independent syntactic (e.g., dependency parse depth) and lexical-semantic (e.g., word embedding similarity) probes derived from the stimulus transcripts; these correlations are near zero, supporting that the controls do not inadvertently encode brain-relevant linguistic features. revision: yes

  2. Referee: [Abstract] Abstract (414 predictive-control and 4320 stripping rows): without the precise definitions of the low-level neural baselines and the exact procedure for mechanism stripping, it is impossible to verify that the controls are severe enough to isolate structural alignment rather than merely absorbing spectro-temporal variance that participates in cortical language processing.

    Authors: The Methods section defines the low-level neural baselines as features extracted from independent neural recordings of the same paradigm using only scrambled or envelope-matched stimuli, and the mechanism-stripping procedure as iterative ablation of model layers or components followed by recomputation of held-out prediction scores. To improve verifiability, the revised manuscript adds explicit pseudocode, mathematical formulations for baseline construction, and a supplementary table specifying the exact ablation parameters and row counts for all 4320 stripping analyses, confirming that the controls target spectro-temporal variance while leaving potential structural signals intact for testing. revision: yes

Circularity Check

0 steps flagged

No significant circularity: L-PACT applies independent external controls and ceilings to downgrade model-brain scores.

full rationale

The paper's central claim rests on comparing model-to-brain prediction scores against nuisance baselines, WAV-derived acoustic envelopes, brain-to-brain reliability ceilings, and mechanism-stripping recomputations. These controls are described as independent (e.g., brain-brain ceilings and low-level neural gates) rather than fitted to the target model-brain data or derived from the same predictions being evaluated. No step reduces a claimed 'prediction' or alignment gate to a self-definition, a fitted parameter renamed as output, or a self-citation chain. The taxonomy of 'control-explained' rows follows directly from explicit comparisons to these external benchmarks, keeping the derivation self-contained against the stated controls.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that brain-brain reliability ceilings and nuisance baselines constitute exhaustive controls for non-alignment explanations.

axioms (1)
  • domain assumption Brain-brain reliability ceilings provide a valid upper bound against which model-brain evidence should be normalized
    Invoked to downgrade raw positive effects

pith-pipeline@v0.9.0 · 5586 in / 1176 out tokens · 36143 ms · 2026-05-15T05:49:38.470077+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 4 internal anchors

  1. [1]

    A. G. Huth, W. A. de Heer, T. L. Griffiths, F. E. Theunissen, J. L. Gallant, Natural speech reveals the semantic maps that tile human cerebral cortex.Nature532, 453–458 (2016)

  2. [2]

    S. Jain, A. G. Huth, Incorporating context into language encoding models for fMRI.Advances in Neural Information Processing Systems31(2018)

  3. [3]

    Schrimpf et al., The neural architecture of language: Integrative modeling converges on predictive processing

    M. Schrimpf et al., The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences118, e2105646118 (2021)

  4. [4]

    Caucheteux, J.-R

    C. Caucheteux, J.-R. King, Brains and algorithms partially converge in natural language processing.Communica- tions Biology5, 134 (2022)

  5. [5]

    Goldstein et al., Shared computational principles for language processing in humans and deep language models

    A. Goldstein et al., Shared computational principles for language processing in humans and deep language models. Nature Neuroscience25, 369–380 (2022)

  6. [6]

    Toneva, L

    M. Toneva, L. Wehbe, Interpreting and improving natural-language processing in machines with natural language- processing in the brain.Advances in Neural Information Processing Systems32(2019)

  7. [7]

    Tuckute et al., Driving and suppressing the human language network using large language models.Nature Human Behaviour8, 544–561 (2024)

    G. Tuckute et al., Driving and suppressing the human language network using large language models.Nature Human Behaviour8, 544–561 (2024)

  8. [8]

    Pereira et al., Toward a universal decoder of linguistic meaning from brain activation.Nature Communications 9, 963 (2018)

    F. Pereira et al., Toward a universal decoder of linguistic meaning from brain activation.Nature Communications 9, 963 (2018)

  9. [9]

    Wehbe et al., Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses.PLOS ONE9, e112575 (2014)

    L. Wehbe et al., Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses.PLOS ONE9, e112575 (2014)

  10. [10]

    Lerner, C

    Y. Lerner, C. J. Honey, L. J. Silbert, U. Hasson, Topographic mapping of a hierarchy of temporal receptive windows using a narrated story.Journal of Neuroscience31, 2906–2915 (2011)

  11. [11]

    Fedorenko, A

    E. Fedorenko, A. Behr, N. Kanwisher, Functional specificity for high-level linguistic processing in the human brain.Proceedings of the National Academy of Sciences108, 16428–16433 (2011)

  12. [12]

    Blank, Z

    I. Blank, Z. Balewski, K. Mahowald, E. Fedorenko, Syntactic processing is distributed across the language system. NeuroImage127, 307–323 (2016)

  13. [13]

    J. R. Brennan, E. P. Stabler, S. E. Van Wagenen, W.-M. Luh, J. T. Hale, Abstract linguistic structure correlates with temporal activity during naturalistic comprehension.Brain and Language157–158, 81–94 (2016)

  14. [14]

    N. Ding, L. Melloni, H. Zhang, X. Tian, D. Poeppel, Cortical tracking of hierarchical linguistic structures in connected speech.Nature Neuroscience19, 158–164 (2016)

  15. [15]

    Brodbeck, A

    C. Brodbeck, A. Presacco, J. Z. Simon, Rapid transformation from auditory to linguistic representations of continuous speech.Current Biology28, 3976–3983.e5 (2018). 12

  16. [16]

    M. P. Broderick, A. J. Anderson, G. M. Di Liberto, M. J. Crosse, E. C. Lalor, Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech.Current Biology28, 803–809.e3 (2018)

  17. [17]

    Schrimpf et al., Brain-Score: a benchmark for neural predictivity of artificial visual systems.bioRxiv, 407007 (2018)

    M. Schrimpf et al., Brain-Score: a benchmark for neural predictivity of artificial visual systems.bioRxiv, 407007 (2018)

  18. [18]

    D. L. K. Yamins et al., Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences111, 8619–8624 (2014)

  19. [19]

    Mikolov, I

    T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality.Advances in Neural Information Processing Systems26(2013)

  20. [20]

    Pennington, R

    J. Pennington, R. Socher, C. D. Manning, GloVe: global vectors for word representation.Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543 (2014)

  21. [21]

    Vaswani et al., Attention is all you need.Advances in Neural Information Processing Systems30(2017)

    A. Vaswani et al., Attention is all you need.Advances in Neural Information Processing Systems30(2017)

  22. [22]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding.Proceedings of NAACL-HLT, 4171–4186 (2019)

  23. [23]

    Radford et al., Language models are unsupervised multitask learners.OpenAI Technical Report(2019)

    A. Radford et al., Language models are unsupervised multitask learners.OpenAI Technical Report(2019)

  24. [24]

    T. B. Brown et al., Language models are few-shot learners.Advances in Neural Information Processing Systems 33, 1877–1901 (2020)

  25. [25]

    Scaling Laws for Neural Language Models

    J. Kaplan et al., Scaling laws for neural language models.arXiv:2001.08361 (2020)

  26. [26]

    Training Compute-Optimal Large Language Models

    J. Hoffmann et al., Training compute-optimal large language models.arXiv:2203.15556 (2022)

  27. [27]

    S. Biderman et al., Pythia: a suite for analyzing large language models across training and scaling.Proceedings of the 40th International Conference on Machine Learning, 2397–2430 (2023)

  28. [28]

    Qwen2.5 Technical Report

    A. Yang et al., Qwen2.5 Technical Report.arXiv:2412.15115 (2024)

  29. [29]

    Qwen3 Technical Report

    A. Yang et al., Qwen3 Technical Report.arXiv:2505.09388 (2025)

  30. [30]

    Naselaris, K

    T. Naselaris, K. N. Kay, S. Nishimoto, J. L. Gallant, Encoding and decoding in fMRI.NeuroImage56, 400–410 (2011)

  31. [31]

    Hastie, R

    T. Hastie, R. Tibshirani, J. Friedman,The Elements of Statistical Learning, 2nd ed. (Springer, 2009)

  32. [32]

    Stone, Cross-validatory choice and assessment of statistical predictions.Journal of the Royal Statistical Society: Series B36, 111–133 (1974)

    M. Stone, Cross-validatory choice and assessment of statistical predictions.Journal of the Royal Statistical Society: Series B36, 111–133 (1974)

  33. [33]

    Kriegeskorte, M

    N. Kriegeskorte, M. Mur, P. Bandettini, Representational similarity analysis: connecting the branches of systems neuroscience.Frontiers in Systems Neuroscience2, 4 (2008)

  34. [34]

    Kornblith, M

    S. Kornblith, M. Norouzi, H. Lee, G. Hinton, Similarity of neural network representations revisited.Proceedings of the 36th International Conference on Machine Learning, 3519–3529 (2019)

  35. [35]

    Hoefling, M

    L. Hoefling, M. Tangemann, L. Piefke, S. Keller, M. Bethge, K. Franke, Only Brains Align with Brains: cross-region alignment patterns expose limits of normative models.International Conference on Learning Representations (ICLR), poster. OpenReview:cMGJcHHI7d (2026)

  36. [36]

    Feather, M

    J. Feather, M. Khosla, N. A. R. Murty, A. Nayebi, Brain-model evaluations need the NeuroAI Turing Test. arXiv:2502.16238 (2025)

  37. [37]

    C. F. Cadieu et al., Deep neural networks rival the representation of primate IT cortex for core visual object recognition.PLOS Computational Biology10, e1003963 (2014)

  38. [38]

    Khaligh-Razavi, N

    S.-M. Khaligh-Razavi, N. Kriegeskorte, Deep supervised, but not unsupervised, models may explain IT cortical representation.PLOS Computational Biology10, e1003915 (2014)

  39. [39]

    D. L. K. Yamins, J. J. DiCarlo, Using goal-driven deep learning models to understand sensory cortex.Nature Neuroscience19, 356–365 (2016). 13

  40. [40]

    A. J. E. Kell, D. L. K. Yamins, E. N. Shook, S. V. Norman-Haignere, J. H. McDermott, A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron98, 630–644.e16 (2018)

  41. [41]

    B. A. Richards et al., A deep learning framework for neuroscience.Nature Neuroscience22, 1761–1770 (2019)

  42. [42]

    R. M. Cichy et al., The Algonauts Project 2021 Challenge: How the human brain makes sense of a world in motion.arXiv:2104.13714 (2021)

  43. [43]

    Narratives

    S. A. Nastase et al., The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehension. Scientific Data8, 250 (2021)

  44. [44]

    K. J. Gorgolewski et al., The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments.Scientific Data3, 160044 (2016)

  45. [45]

    C. J. Markiewicz et al., The OpenNeuro resource for sharing of neuroscience data.eLife10, e71774 (2021)

  46. [46]

    Gramfort et al., MEG and EEG data analysis with MNE-Python.Frontiers in Neuroscience7, 267 (2013)

    A. Gramfort et al., MEG and EEG data analysis with MNE-Python.Frontiers in Neuroscience7, 267 (2013)

  47. [47]

    R. A. Poldrack et al., Scanning the horizon: towards transparent and reproducible neuroimaging research.Nature Reviews Neuroscience18, 115–126 (2017)

  48. [48]

    A. E. Hoerl, R. W. Kennard, Ridge regression: biased estimation for nonorthogonal problems.Technometrics12, 55–67 (1970)

  49. [49]

    Pedregosa et al., Scikit-learn: machine learning in Python.Journal of Machine Learning Research12, 2825–2830 (2011)

    F. Pedregosa et al., Scikit-learn: machine learning in Python.Journal of Machine Learning Research12, 2825–2830 (2011)

  50. [50]

    Varma, R

    S. Varma, R. Simon, Bias in error estimation when using cross-validation for model selection.BMC Bioinformatics 7, 91 (2006)

  51. [51]

    Varoquaux et al., Assessing and tuning brain decoders: cross-validation, caveats, and guidelines.NeuroImage 145, 166–179 (2017)

    G. Varoquaux et al., Assessing and tuning brain decoders: cross-validation, caveats, and guidelines.NeuroImage 145, 166–179 (2017)

  52. [52]

    Yarkoni, J

    T. Yarkoni, J. Westfall, Choosing prediction over explanation in psychology: lessons from machine learning. Perspectives on Psychological Science12, 1100–1122 (2017)

  53. [53]

    Mantel, The detection of disease clustering and a generalized regression approach.Cancer Research27, 209–220 (1967)

    N. Mantel, The detection of disease clustering and a generalized regression approach.Cancer Research27, 209–220 (1967)

  54. [54]

    J. V. Haxby et al., Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science293, 2425–2430 (2001)

  55. [55]

    Cortes, M

    C. Cortes, M. Mohri, A. Rostamizadeh, Algorithms for learning kernels based on centered alignment.Journal of Machine Learning Research13, 795–828 (2012)

  56. [56]

    Gretton, O

    A. Gretton, O. Bousquet, A. Smola, B. Schoelkopf, Measuring statistical dependence with Hilbert-Schmidt norms. Algorithmic Learning Theory, 63–77 (2005)

  57. [57]

    Kriegeskorte, W

    N. Kriegeskorte, W. K. Simmons, P. S. F. Bellgowan, C. I. Baker, Circular analysis in systems neuroscience: the dangers of double dipping.Nature Neuroscience12, 535–540 (2009)

  58. [58]

    Efron, R

    B. Efron, R. J. Tibshirani,An Introduction to the Bootstrap(Chapman and Hall/CRC, 1994)

  59. [59]

    T. E. Nichols, A. P. Holmes, Nonparametric permutation tests for functional neuroimaging: a primer with examples.Human Brain Mapping15, 1–25 (2002)

  60. [60]

    Causal” and “Turing-bounded

    Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B57, 289–300 (1995). 14 Supplementary Information Supplementary Overview This Supplementary Information (SI) documents the implementation-faithful evidence map for the main L-PACT manuscrip...

  61. [61]

    Level 1, predictive adequacy: model-derived features must improve held-out neural prediction relative to nuisance baselines and the strongest available severe control

  62. [62]

    Level 2, relational adequacy: model-to-brain alignment profiles must reproduce brain-to-brain alignment profiles over a shared brain-unit order

  63. [63]

    Level 3, counterfactual mechanism-stripping adequacy: removing a candidate mechanism from model features must selectively damage prediction for matching neural targets more than for nonmatching targets within the implemented predictor

  64. [64]

    The locked analysis package reports a control-explained model outcome

    Level 4, reliability-bounded adequacy: the surviving model evidence must be interpreted relative to brain-brain reliability or ceiling estimates. The locked analysis package reports a control-explained model outcome. The final decision table contains 146 integrated rows. All 146 rows are labelledcontrol_explained. No row passes the predictive, relational,...