arxiv: 2605.08837 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans

Odysseas S. Chlapanis , Orfeas Menis Mastromichalakis , Christos H. Papadimitriou

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLMsabstract conceptsgroundingproperty generationcognitive sciencesparse autoencoders

0 comments

The pith

LLMs anchor abstract concepts through word associations rather than human-like emotional and experiential grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replicates cognitive science property-generation experiments on 21 frontier and open-weight LLMs to test how they ground abstract concepts such as justice or theory. Models generate property lists dominated by word associations while producing fewer features tied to emotions and internal states, yielding low similarity to human data. When the same models rate concepts directly on grounding categories, their responses align more closely with humans and improve as model size increases. Internal inspection via sparse autoencoders finds features linked to sensorimotor and social dimensions, indicating that relevant information is encoded but not activated during free generation.

Core claim

LLMs recover grounding dimensions when explicitly queried but do not recruit them in a human-like way when words are generated freely for abstract concepts.

What carries the argument

The property-generation task, in which participants list properties associated with abstract concepts to reveal their grounding sources.

Load-bearing premise

The property-generation task and rating scales from cognitive science measure the same underlying grounding process in LLMs as in humans, allowing direct numerical comparison of their outputs.

What would settle it

A replication of the property-generation task in which models receive explicit instructions to include emotional and internal-state properties, followed by a check of whether correlations with human lists rise substantially.

Figures

Figures reproduced from arXiv: 2605.08837 by Christos H. Papadimitriou, Odysseas S. Chlapanis, Orfeas Menis Mastromichalakis.

**Figure 2.** Figure 2: Mean r for all LLMs vs. the human ceiling on Experiment 1 [Harpaintner et al., 2018]. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Pearson-r correlation heatmaps for three representative frontier models and humans. larger models and stronger general-purpose systems do not show reliably higher human alignment, suggesting that the mismatch is not simply a capability bottleneck that shrinks with model size. Inter-model correlations reveal a complementary pattern. As shown in Figure 3a, models correlate substantially more with one another… view at source ↗

**Figure 4.** Figure 4: Mean r for all LLMs vs. the human ceiling on the rating experiment [Troche et al., 2017]. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Percentage of mean r for all LLMs and SAE features vs. the estimated human ceiling mean [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Property-frequency boxplots for all 21 LLMs, sorted by Mean r. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Mean r for all LLMs vs. the human ceiling on Experiment 2 [Kelly et al., 2024]. Parameter [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Experiment 2 properties correlation heatmap for the concrete word subset. The correlation [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Mean Pearson r vs. number of runs averaged, on Experiment 1, for GEMMA-3-4B-IT coded by Gemini-2.5-Flash-Lite. Solid line: mean over 50 random subsets at each subset size; band: ±1 std across subsets; dashed: 100-run asymptote; dotted: the 10-run cadence used by the Experiment 1 leaderboard. E Variance and uncertainty in Experiment 1 The Experiment 1 leaderboard reports each model’s Mean Pearson r as the p… view at source ↗

**Figure 10.** Figure 10: Property generation prompt for Experiment 1 Harpaintner et al. [2018], reproduced [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Property generation prompt used for Experiment 2 Kelly et al. [2024], reproduced verbatim. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Coding prompt for Experiment 1. Categories are listed verbatim from Harpaintner et al. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Rating prompt for the Troche benchmark, shown for the [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Number of SAE features per layer identified with our detection algorithm in Gemma-3-4B. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Boxplot for steered model with sensory (top) and internal (bottom) features. [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

read the original abstract

Abstract concepts - justice, theory, availability - have no single perceivable referent; in the human brain, their meaning emerges from a web of experiences, affect, and social context. Do large language models (LLMs) ground abstract concepts in a similar way? We study this by replicating property-generation experiments from cognitive science on 21 frontier and open-weight LLMs. Across models and experiments, we find a consistent pattern: when compared to humans, models rely too heavily on word associations, and underproduce properties tied to emotion and internal states. This yields a large and consistent grounding gap: no model exceeds a Pearson correlation r=0.37 with human responses, compared to a human-to-human ceiling above r=0.9. To better interpret this gap, we also replicate a rating experiment on grounding categories and find that here LLMs align more closely with human judgment, and alignment improves as models get larger. We then use sparse autoencoders (SAEs) to inspect whether this information is also reflected in the models' internal features, and we do identify features connected to grounding dimensions such as "sensorimotor" and "social". These findings suggest that current LLMs can recover grounding dimensions when explicitly queried, but do not recruit them in a human-like way when words are generated freely.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper documents a consistent but task-specific gap where LLMs generate abstract concept properties with low human correlation (max r=0.37) while showing better alignment on ratings and internal SAE features.

read the letter

The key point is that LLMs produce abstract concept properties in a way that correlates only modestly with human data, maxing at r=0.37 against a human ceiling above 0.9, and they over-rely on word associations while under-producing emotion and internal-state properties. This pattern holds across 21 models in the property-generation task but narrows when models are asked to rate grounding categories explicitly, with larger models doing better there. The SAE inspection adds that features linked to sensorimotor and social dimensions exist internally even when free generation does not surface them in a human-like distribution.

Referee Report

3 major / 2 minor

Summary. The paper replicates property-generation experiments from cognitive science on 21 LLMs to compare how they ground abstract concepts (e.g., justice, theory) with humans. It reports consistent low Pearson correlations (no model exceeds r=0.37) versus human-human ceiling >0.9, with LLMs over-relying on word associations and under-producing emotion/internal-state properties. A rating task shows better alignment that scales with model size, and SAE analysis identifies internal features tied to grounding dimensions such as sensorimotor and social. The authors conclude LLMs recover grounding info when explicitly queried but do not recruit it human-like in free generation.

Significance. If the patterns are robust, the work offers a large-scale empirical demonstration of systematic differences in abstract-concept representation between LLMs and humans, with implications for interpretability, cognitive modeling, and AI safety. Strengths include the breadth across 21 models, dual behavioral paradigms, and the SAE probe of internal representations. The significance hinges on whether the property-generation outputs provide commensurable measures of grounding.

major comments (3)

[Methods] Methods section: the replication of the property-generation task omits exact prompt wording, response coding scheme (how properties are classified as word-association vs. emotion/internal-state), per-model sample sizes, exclusion criteria, and statistical controls. These omissions prevent full evaluation of the r=0.37 ceiling and the frequency-distribution comparisons that support the grounding-gap claim.
[Results (property-generation experiment)] Results (property-generation experiment): the central claim that LLMs under-recruit grounding dimensions assumes the free-generation task elicits directly comparable conceptual-structure measures in LLMs and humans. Because LLMs operate via next-token statistics, output-style or prompt-interpretation differences could produce the observed gap; a concrete control (e.g., length-matched or format-constrained generation) is needed to isolate grounding from response-regime effects.
[SAE analysis] SAE analysis section: the identification of features linked to 'sensorimotor' and 'social' dimensions should report quantitative activation statistics during the generation versus rating regimes and the precise mapping criteria used to label features; without these, it is unclear whether the internal representations are recruited equivalently in the free-generation setting that drives the main claim.

minor comments (2)

[Abstract] Abstract: the phrase 'two experiment types' should explicitly name the property-generation and rating tasks to improve clarity for readers unfamiliar with the cognitive-science paradigm.
[Figures] Figures: correlation plots should include human-human baseline lines and confidence intervals to allow immediate visual assessment of the reported gap.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below, indicating the revisions we will make to improve the manuscript's clarity, reproducibility, and rigor.

read point-by-point responses

Referee: [Methods] Methods section: the replication of the property-generation task omits exact prompt wording, response coding scheme (how properties are classified as word-association vs. emotion/internal-state), per-model sample sizes, exclusion criteria, and statistical controls. These omissions prevent full evaluation of the r=0.37 ceiling and the frequency-distribution comparisons that support the grounding-gap claim.

Authors: We agree that additional methodological detail is necessary for full reproducibility and evaluation. In the revised manuscript, we will include the exact prompt templates used across models, a detailed description of the property coding scheme with explicit criteria and examples for classifying properties (e.g., word-association vs. emotion/internal-state), per-model sample sizes, exclusion criteria applied to responses, and any statistical controls used in the Pearson correlations and frequency analyses. revision: yes
Referee: [Results (property-generation experiment)] Results (property-generation experiment): the central claim that LLMs under-recruit grounding dimensions assumes the free-generation task elicits directly comparable conceptual-structure measures in LLMs and humans. Because LLMs operate via next-token statistics, output-style or prompt-interpretation differences could produce the observed gap; a concrete control (e.g., length-matched or format-constrained generation) is needed to isolate grounding from response-regime effects.

Authors: We acknowledge the potential for response-regime confounds given LLMs' next-token prediction mechanism. Our design deliberately replicates the free-generation protocol from human cognitive science studies to enable direct comparison, and the substantially higher alignment in the rating task (using identical models but a different output format) indicates the gap is not solely attributable to generation style. We will expand the discussion section to explicitly address this limitation and potential alternative explanations. However, adding new constrained-generation controls would require substantial additional experiments beyond the current scope; we therefore treat this as a point for future work rather than a revision to the present analyses. revision: partial
Referee: [SAE analysis] SAE analysis section: the identification of features linked to 'sensorimotor' and 'social' dimensions should report quantitative activation statistics during the generation versus rating regimes and the precise mapping criteria used to label features; without these, it is unclear whether the internal representations are recruited equivalently in the free-generation setting that drives the main claim.

Authors: We appreciate this request for greater quantitative detail. In the revised SAE analysis section, we will report activation statistics (including mean activations, distributions, and comparisons) for the identified features across both the generation and rating regimes. We will also provide a precise description of the mapping criteria, including how features were selected and labeled based on top-activating examples, correlations with grounding dimensions, and any thresholding or validation steps used. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical measurements

full rationale

The paper reports direct experimental replications of property-generation tasks and rating scales from cognitive science, followed by Pearson correlations (r ≤ 0.37 vs. human ceiling > 0.9) and SAE feature inspection. No equations, fitted parameters, or derivations appear; results are computed from model outputs and human data without any step that reduces the claimed gap to a self-definition, renamed input, or self-citation chain. The central findings rest on observable frequency distributions and internal activations rather than any load-bearing assumption that is justified only by prior author work or by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper rests on the validity of prior cognitive-science tasks and standard statistical comparison methods; no free parameters are introduced to define the gap itself.

axioms (2)

domain assumption Property-generation responses from humans and LLMs can be directly compared via Pearson correlation as measures of shared grounding.
Invoked when reporting r=0.37 versus human ceiling >0.9
standard math Standard assumptions of Pearson correlation (linearity, normality of residuals) hold for the property-count data.
Used for all reported correlations

invented entities (1)

grounding gap no independent evidence
purpose: Label for the observed systematic difference in property generation between LLMs and humans
Introduced in the abstract to summarize the main empirical discrepancy

pith-pipeline@v0.9.0 · 5552 in / 1417 out tokens · 52658 ms · 2026-05-12T03:12:05.078398+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We replicate two property-generation experiments... map the generated properties to the taxonomy... compare the resulting per-word category generations... Mean r... SAE features connected to grounding dimensions such as 'sensorimotor' and 'social'
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

no model exceeds a Pearson correlation r=0.37 with human responses... LLMs rely too heavily on word associations and underproduce properties tied to emotion and internal states

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 1 internal anchor

[1]

and Binkofski, Ferdinand and Castelfranchi, Cristiano and Cimatti, Felice and Scorolli, Claudia and Tummolini, Luca , title =

Borghi, Anna M. and Binkofski, Ferdinand and Castelfranchi, Cristiano and Cimatti, Felice and Scorolli, Claudia and Tummolini, Luca , title =. Psychological Bulletin , year =

work page
[2]

and Reilly, Jamie , journal =

Troche, Joshua and Crutch, Sebastian J. and Reilly, Jamie , journal =. Defining a conceptual topography of word concreteness: Clustering properties of emotion, sensation, and magnitude among 750. 2017 , publisher =

work page 2017
[3]

and Seidenberg, Mark S

McRae, Ken and Cree, George S. and Seidenberg, Mark S. and McNorgan, Chris , title =. Behavior Research Methods , volume =. 2005 , doi =

work page 2005
[4]

and Chodorow, Martin and Wu, Minghua and Li, Ping , title =

Xu, Qihui and Peng, Yingying and Nastase, Samuel A. and Chodorow, Martin and Wu, Minghua and Li, Ping , title =. Nature Human Behaviour , volume =. 2025 , doi =

work page 2025
[5]

Behavioral and Brain Sciences , volume =

Perceptual symbol systems , author =. Behavioral and Brain Sciences , volume =. 1999 , publisher =

work page 1999
[6]

Grounding Cognition: The Role of Perception and Action in Memory, Language, and Thought , editor =

Situating abstract concepts , author =. Grounding Cognition: The Role of Perception and Action in Memory, Language, and Thought , editor =. 2005 , publisher =

work page 2005
[7]

General Relativity,

Lakoff, George and Johnson, Mark , year=. Metaphors We Live By , ISBN=. doi:10.7208/chicago/9780226470993.001.0001 , publisher=

work page doi:10.7208/chicago/9780226470993.001.0001
[8]

and Kiefer, Markus , journal =

Harpaintner, Markus and Trumpp, Natalie M. and Kiefer, Markus , journal =. The semantic content of abstract concepts:. 2018 , publisher =

work page 2018
[9]

and Kenett, Yoed N

Kelly, Aubrey E. and Kenett, Yoed N. and Medaglia, John D. and Reilly, Jamie J. and Dudhat, Priyanka and Chrysikou, Evangelia G. , title =. Emotion , year =

work page
[10]

Concreteness ratings for 40 thousand generally known

Brysbaert, Marc and Warriner, Amy Beth and Kuperman, Victor , journal =. Concreteness ratings for 40 thousand generally known. 2014 , publisher =

work page 2014
[11]

Lynott, Dermot and Connell, Louise and Brysbaert, Marc and Brand, James and Carney, James , journal =. The. 2020 , publisher =

work page 2020
[12]

and Keitel, Anne and Becirspahic, Mia and Yao, Bo and Sereno, Sara C

Scott, Graham G. and Keitel, Anne and Becirspahic, Mia and Yao, Bo and Sereno, Sara C. , journal =. The. 2019 , publisher =

work page 2019
[13]

and Binney, Richard J

Diveica, Veronica and Pexman, Penny M. and Binney, Richard J. , journal =. Quantifying social semantics: An inclusive definition of socialness and ratings for 8,388. 2023 , publisher =

work page 2023
[14]

Emotion: Theory, Research, and Experience , editor =

A general psychoevolutionary theory of emotion , author =. Emotion: Theory, Research, and Experience , editor =. 1980 , publisher =

work page 1980
[15]

2024 , eprint =

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders , author =. 2024 , eprint =

work page 2024
[16]

Scientific Reports , year =

Large Language Models Predict Human Sensory Judgments across Six Modalities , author =. Scientific Reports , year =

work page
[17]

Cognitive Science , volume =

Event Knowledge in Large Language Models: The Gap between the Impossible and Unlikely , author =. Cognitive Science , volume =

work page
[18]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

Conceptual Structure Coheres in Human Cognition but Not in Large Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2023 , url =

work page 2023
[19]

Transactions of the Association for Computational Linguistics (TACL) , volume =

Word Representation Learning in Multimodal Pre-Trained Transformers: An Intrinsic Evaluation , author =. Transactions of the Association for Computational Linguistics (TACL) , volume =. 2021 , url =

work page 2021
[20]

2025 , month = jan, note =

Cognitive Alignment Between Humans and LLMs Across Multimodal Domains , author =. 2025 , month = jan, note =. doi:10.21203/rs.3.rs-5736241/v1 , url =

work page doi:10.21203/rs.3.rs-5736241/v1 2025
[21]

Can large language models help augment English psycholinguistic datasets? , volume=

Trott, Sean , year=. Can large language models help augment English psycholinguistic datasets? , volume=. Behavior Research Methods , publisher=. doi:10.3758/s13428-024-02337-z , number=

work page doi:10.3758/s13428-024-02337-z
[22]

Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL) , year =

Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color , author =. Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL) , year =

work page
[23]

ACM Transactions on Computer-Human Interaction (TOCHI) , volume =

Sensorimotor Regularities as Alignment between Humans and Large Language Models , author =. ACM Transactions on Computer-Human Interaction (TOCHI) , volume =. 2026 , month = apr, doi =

work page 2026
[24]

Tiny Papers @ ICLR 2023 , year =

Human-Machine Cooperation for Semantic Feature Listing , author =. Tiny Papers @ ICLR 2023 , year =

work page 2023
[25]

Language Models Represent Space and Time , url =

Gurnee, Wes and Tegmark, Max , booktitle =. Language Models Represent Space and Time , url =

work page
[26]

and Geiger, Atticus and Nanda, Neel

Tigges, Curt and Hollinsworth, Oskar J. and Geiger, Atticus and Nanda, Neel. Language Models Linearly Represent Sentiment. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.5

work page doi:10.18653/v1/2024.blackboxnlp-1.5 2024
[27]

LEACE: Perfect linear concept erasure in closed form , shorttitle =

LEACE: Perfect Linear Concept Erasure in Closed Form , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2306.03819 , archivePrefix =

work page arXiv
[28]

The Eleventh International Conference on Learning Representations , year=

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. The Eleventh International Conference on Learning Representations , year=

work page
[29]

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

Nanda, Neel and Lee, Andrew and Wattenberg, Martin. Emergent Linear Representations in World Models of Self-Supervised Sequence Models. Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2023. doi:10.18653/v1/2023.blackboxnlp-1.2

work page doi:10.18653/v1/2023.blackboxnlp-1.2 2023
[30]

2025 , month = sep, note =

Gemma Scope 2 - Technical Paper , author =. 2025 , month = sep, note =

work page 2025
[31]

2025 , eprint =

AI Shares Emotion with Humans across Languages and Cultures , author =. 2025 , eprint =

work page 2025
[32]

Wang, Chenxi and Zhang, Yixuan and Yu, Ruiji and Zheng, Yufei and Gao, Lang and Song, Zirui and Xu, Zixiang and Xia, Gus and Zhang, Huishuai and Zhao, Dongyan and Chen, Xiuying , year =. Do. 2510.11328 , archivePrefix =

work page arXiv
[33]

2024 , eprint=

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , author =. Advances in Neural Information Processing Systems (NeurIPS), oral , year =. 2409.14507 , archivePrefix =

work page arXiv
[34]

and Wiens, Stefan and Rotshtein, Pia and Öhman, Arne and Dolan, Raymond J

Critchley, Hugo D. and Wiens, Stefan and Rotshtein, Pia and Öhman, Arne and Dolan, Raymond J. , title =. Nature Neuroscience , year =

work page
[35]

Ettinger, Allyson , journal =. What. 2020 , publisher =

work page 2020
[36]

2019 , address =

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , booktitle =. 2019 , address =

work page 2019
[37]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[38]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

work page 2023
[39]

2024 , journal=

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , journal=

work page 2024
[40]

2601.14007 , archivePrefix =

Zhang, Junyu and Kang, Yipeng and Guo, Jiong and Zhan, Jiayu and Wang, Junqi , year =. 2601.14007 , archivePrefix =

work page arXiv
[41]

Transactions on Machine Learning Research , issn=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

work page 2022
[42]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Lieberum, Tom and Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Sonnerat, Nicolas and Varma, Vikrant and Kramar, Janos and Dragan, Anca and Shah, Rohin and Nanda, Neel. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP....

work page doi:10.18653/v1/2024.blackboxnlp-1.19 2024
[43]

and Wang, Zifan and Mallen, Alex and Hendrycks, Dan , booktitle =

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Long and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Hendrycks, Dan , booktitle =. Representation engineering: A top-down approach to. 2024 , ...

work page 2024
[44]

Steering

Panickssery, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alex , booktitle =. Steering. 2024 , publisher =

work page 2024
[45]

2024 , eprint=

Steering Language Models With Activation Engineering , author=. 2024 , eprint=

work page 2024
[46]

Psychology of Learning and Motivation , volume =

Catastrophic interference in connectionist networks: The sequential learning problem , author =. Psychology of Learning and Motivation , volume =. 1989 , publisher =

work page 1989
[47]

Proceedings of the National Academy of Sciences , volume =

Overcoming catastrophic forgetting in neural networks , author =. Proceedings of the National Academy of Sciences , volume =. 2017 , publisher =

work page 2017
[48]

2021 , url =

A general language assistant as a laboratory for alignment , author =. 2021 , url =

work page 2021
[49]

Language and Cognition , volume =

Varieties of abstract concepts and their multiple dimensions , author =. Language and Cognition , volume =. 2019 , publisher =

work page 2019
[50]

Behavioral and Brain Sciences , volume =

On a confusion about a function of consciousness , author =. Behavioral and Brain Sciences , volume =. 1995 , publisher =

work page 1995
[51]

Physics of Life Reviews , volume =

Words as social tools: Language, sociality and inner grounding in abstract concepts , author =. Physics of Life Reviews , volume =. 2018 , publisher =

work page 2018
[52]

2025 , howpublished =

Louapre, David , title =. 2025 , howpublished =

work page 2025
[53]

NeurIPS 2024 Workshop on Safe Generative AI , year =

Farrell, Eoin and Lau, Yeu-Tong and Conmy, Arthur , title =. NeurIPS 2024 Workshop on Safe Generative AI , year =

work page 2024
[54]

2025 , eprint =

Khoriaty, Matthew and Shportko, Andrii and Mercier, Gustavo and Wood-Doughty, Zach , title =. 2025 , eprint =

work page 2025
[55]

and Chen, Brian and Citro, Craig and others , title =

Ameisen, Emmanuel and Lindsey, Jack and Pearce, Adam and Gurnee, Wes and Turner, Nicholas L. and Chen, Brian and Citro, Craig and others , title =. Transformer Circuits Thread , year =

work page
[56]

Transformer Circuits Thread , year =

Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, ...

work page
[57]

, title =

Barsalou, Lawrence W. , title =. Open Encyclopedia of Cognitive Science , editor =. 2026 , url =

work page 2026
[58]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

work page 2022
[59]

2026 , eprint =

Niklaus, Joel and others , title =. 2026 , eprint =

work page 2026
[60]

Richard and Koch, Gary G

Landis, J. Richard and Koch, Gary G. , title =. Biometrics , volume =. 1977 , publisher =

work page 1977
[61]

, title =

Li, Xuechen and Zhang, Tianyi and Dubois, Yann and Taori, Rohan and Gulrajani, Ishaan and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , title =. 2023 , howpublished =

work page 2023
[62]

2024 , howpublished =

Lin, Johnny and Bloom, Joseph , title =. 2024 , howpublished =

work page 2024
[63]

Transformer Circuits Thread , year =

Lindsey, Jack , title =. Transformer Circuits Thread , year =

work page
[64]

Pepper, Keenan and McKenzie, Alex and Pop, Florin and Servaes, Stijn and Leitgab, Martin and Vaiana, Mike and Rosenblatt, Judd and Graziano, Michael S. A. and de Lucena, Diogo , title =. arXiv preprint arXiv:2602.10352 , year =

work page arXiv
[65]

doi: 10.18653/v1/2020.acl-main.463

Bender, Emily M. and Koller, Alexander. Climbing towards NLU : On Meaning, Form, and Understanding in the Age of Data. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.463

work page doi:10.18653/v1/2020.acl-main.463 2020
[66]

In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency

Bender, Emily M. and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret , title =. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2021 , isbn =. doi:10.1145/3442188.3445922 , abstract =

work page doi:10.1145/3442188.3445922 2021
[67]

2026 , howpublished =

Gemini 3 , author =. 2026 , howpublished =

work page 2026
[68]

Neuron , volume =

Somatotopic representation of action words in human motor and premotor cortex , author =. Neuron , volume =. 2004 , doi =

work page 2004
[69]

Nature Reviews Neuroscience , volume =

Brain mechanisms linking language and action , author =. Nature Reviews Neuroscience , volume =. 2005 , doi =

work page 2005
[70]

Cognitive Brain Research , volume =

Visual semantic features are activated during the processing of concrete words: event-related potential evidence for perceptual semantic priming , author =. Cognitive Brain Research , volume =. 2000 , doi =

work page 2000
[71]

2026 , eprint=

Emotion Concepts and their Function in a Large Language Model , author=. 2026 , eprint=

work page 2026
[72]

Journal of Neuroscience , volume =

Emotional and Social Dimension of Abstract Concepts Meet with Interoception in Right Anterior Insula , author =. Journal of Neuroscience , volume =. 2026 , doi =

work page 2026
[73]

Frontiers in Psychology , volume =

Grounding abstract concepts and beliefs into experience: The embodied perspective , author =. Frontiers in Psychology , volume =. 2022 , doi =

work page 2022
[74]

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity , author=. arXiv preprint arXiv:2604.24827 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

work page 2022
[76]

arXiv , year =

Manheim, David and Garrabrant, Scott , title =. arXiv , year =

work page
[77]

Findings of the B aby LM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Warstadt, Alex and Mueller, Aaron and Choshen, Leshem and Wilcox, Ethan and Zhuang, Chengxu and Ciro, Juan and Mosquera, Rafael and Paranjabe, Bhargavi and Williams, Adina and Linzen, Tal and Cotterell, Ryan. Findings of the B aby LM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora. Proceedings of the BabyLM Challenge at the 27...

work page doi:10.18653/v1/2023.conll-babylm.1 2023
[78]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph and Stoica, Ion , booktitle =. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =

work page
[79]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[80]

The Fourteenth International Conference on Learning Representations , year=

Fan, Yu and Ni, Jingwei and Merane, Jakob and Tian, Yang and Hermstr. The Fourteenth International Conference on Learning Representations , year=

work page

Showing first 80 references.