Recognition: unknown
Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives
Pith reviewed 2026-05-07 16:20 UTC · model grok-4.3
The pith
Large language models fail to grasp proximal-distal contrasts in demonstratives and show no cultural differences
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using demonstratives as a novel probe for grounded knowledge, the study establishes that English speakers distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently yet tolerate distal ambiguity; five state-of-the-art LLMs fail to inherently understand the proximal-distal contrast and show no cultural differences, defaulting to English-centric reasoning.
What carries the argument
Demonstratives as a probe for embodied cognition and cultural conventions in spatial reference and perspective-taking
If this is right
- Introduces a new task based on demonstratives for evaluating embodied cognition and cultural conventions.
- Supplies empirical evidence of cross-cultural asymmetries in human interpretation of spatial expressions.
- Shows that egocentric and sociocentric orientations coexist but vary in strength across languages.
- Calls for future model design to address individual variation within and across languages.
Where Pith is reading between the lines
- Models may need explicit multimodal or embodied training signals to develop the spatial perspective-taking abilities that text alone does not produce.
- The same demonstrative probe could be applied to other linguistic categories to map additional gaps in cultural grounding.
- Individual differences within each language community could be modeled separately rather than averaged into a single cultural template.
Load-bearing premise
That observed differences in human demonstrative judgments directly reflect embodied cognition and cultural conventions rather than task-specific factors such as education, response style, or prompt interpretation.
What would settle it
A replication in which the same LLMs, after additional training on culturally diverse spatial-language data, begin to reproduce the distinct proximal-distal and perspective-switching patterns shown by native English and Chinese speakers.
Figures
read the original abstract
Do large language models (LLMs) truly acquire embodied cognition and cultural conventions from text? We introduce demonstratives, fundamental spatial expressions like "this/that" in English and "zh\`e/n\`a" in Chinese, as a novel probe for grounded knowledge. Using 6,400 responses from 320 native speakers, we establish a human baseline: English speakers reliably distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to inherently understand the proximal-distal contrast and show no cultural differences, defaulting to English-centric reasoning. Our study contributes (i) a new task, based on demonstratives, as a new lens for evaluating embodied cognition and cultural conventions; (ii) empirical evidence of cross-cultural asymmetries in human interpretation; (iii) a new perspective on the egocentric-sociocentric debate, showing both orientations coexist but vary across languages; and (iv) a call to address individual variation in future model design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces demonstratives as a probe for embodied cognition and cultural conventions in LLMs. It reports a human baseline from 6,400 responses by 320 native speakers showing English speakers reliably distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to show the proximal-distal contrast or cultural differences, defaulting to English-centric reasoning. The work contributes a new evaluation task, cross-linguistic human data, and implications for the egocentric-sociocentric debate.
Significance. If the negative LLM result holds after addressing methodological controls, the study offers a falsifiable, cross-linguistic test of whether text-only training suffices for grounded spatial and cultural knowledge. The large human sample provides a clear baseline, and the task design directly targets a linguistic phenomenon with documented embodied and cultural dimensions. This could inform future model evaluation beyond English-centric benchmarks.
major comments (1)
- [LLM evaluation / prompting procedure] The central claim that LLMs 'default to English-centric reasoning' and fail to capture Chinese conventions requires explicit confirmation that prompts for Chinese demonstratives (zhè/nà) were presented in Chinese or used language-matched templates. If the LLM evaluation used English-only prompts or English-centric framing even for the Chinese condition (as implied by the abstract's contrast), the observed default follows from training-data distribution and does not isolate absence of cultural knowledge acquired from text. This assumption is load-bearing for the claim that the results demonstrate failure to capture embodied/cultural conventions rather than prompt-language effects. Please add the exact prompt templates, language of presentation, and any controls for prompt language in the LLM methods section.
minor comments (2)
- [Results / statistical analysis] Clarify the exact statistical tests and controls used for the human-LLM comparison (e.g., how individual variation and response-style differences were handled).
- [Abstract and §5] The abstract states 'no cultural differences' for LLMs; specify whether this is a null result with power analysis or simply absence of the human-like pattern.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our manuscript. We address the major comment point by point below and will incorporate the requested clarifications into a revised version.
read point-by-point responses
-
Referee: The central claim that LLMs 'default to English-centric reasoning' and fail to capture Chinese conventions requires explicit confirmation that prompts for Chinese demonstratives (zhè/nà) were presented in Chinese or used language-matched templates. If the LLM evaluation used English-only prompts or English-centric framing even for the Chinese condition (as implied by the abstract's contrast), the observed default follows from training-data distribution and does not isolate absence of cultural knowledge acquired from text. This assumption is load-bearing for the claim that the results demonstrate failure to capture embodied/cultural conventions rather than prompt-language effects. Please add the exact prompt templates, language of presentation, and any controls for prompt language in the LLM methods section.
Authors: We appreciate the referee's emphasis on methodological transparency for isolating cultural knowledge from prompt-language effects. In the original experiments, prompts for the Chinese condition used language-matched templates: instructions and response options were written in Chinese, incorporating the demonstratives 'zhè' (proximal) and 'nà' (distal) directly in the prompt text, while the English condition used parallel English templates with 'this' and 'that'. This design was intended to minimize English-centric framing. However, the current manuscript does not include the verbatim prompt templates or an explicit statement confirming the presentation language for each condition. We will revise the LLM evaluation subsection of the Methods to add the full English and Chinese prompt templates, describe the language of presentation, and note any additional controls (e.g., zero-shot consistency checks across languages). This addition will directly address the concern and strengthen the interpretation that the observed English-centric defaults reflect limitations in acquired knowledge rather than prompt artifacts. revision: yes
Circularity Check
No circularity in empirical comparison of human and LLM responses
full rationale
The paper presents a purely empirical study: it collects 6,400 human judgments from native speakers on demonstrative tasks in English and Chinese, then evaluates five LLMs on the same items and reports descriptive differences. No equations, parameters, or derivations are defined; there are no fitted quantities renamed as predictions, no self-definitional constructs, and no load-bearing self-citations that reduce the central claim to prior author work by construction. The analysis rests on direct data collection and comparison, which is self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Demonstratives encode embodied spatial cognition and cultural conventions that can be probed via controlled reference tasks.
Reference graph
Works this paper leans on
-
[1]
Think-while-generating: On-the-fly reason- ing for personalized long-form generation.arXiv preprint arXiv:2512.06690. Xiaoyang Wang, Liang Wu, Tianyi Ma, and Bing Liu
-
[2]
InProceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 13245–13259, Miami, Florida, USA
Fac 2e: Better understanding large language model capabilities by dissociating language and cog- nition. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 13245–13259, Miami, Florida, USA. Association for Computational Linguistics. Yu Wang, Emmanuele Chersoni, and Chu-Ren Huang
2024
-
[3]
InLanguage Resources and Evaluation Con- ference 2026
This one or that one? a study on accessibility via demonstratives with multimodal large language models. InLanguage Resources and Evaluation Con- ference 2026. European Language Resources Associ- ation (ELRA). Qihui Xu, Yingying Peng, Samuel A Nastase, Martin Chodorow, Minghua Wu, and Ping Li. 2025. Large language models without grounding recover non- sen...
2026
-
[4]
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Question calibration and multi-hop modeling for temporal question answering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19332–19340. Chao Xue, Yao Wang, Mengqiao Liu, Di Liang, Xing- sheng Han, Peiyang Liu, Xianjie Wu, Chenyao Lu, Lei Jiang, Yu Lu, Haibo Shi, Shuang Liang, Minlong Peng, and Flora D. Salim. 2026. Rea-...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. 2025. Futuresightdrive: Thinking visu- ally with spatio-temporal cot for autonomous driving. arXiv preprint arXiv:2505.17685. Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, and Zhiming Z...
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.