arxiv: 2604.25423 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI

Recognition: unknown

Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives

Yu Wang , Emmanuele Chersoni , Chu-Ren Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords demonstrativesembodied cognitioncultural variationlarge language modelsproximal-distalperspective-takingcross-linguisticspatial language

0 comments

The pith

Large language models fail to grasp proximal-distal contrasts in demonstratives and show no cultural differences

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses demonstratives such as English 'this/that' and Chinese 'zhè/nà' to test whether LLMs acquire embodied spatial knowledge and cultural conventions from text alone. Human participants establish a baseline in which English speakers reliably distinguish proximal from distal referents yet struggle to shift perspectives, while Chinese speakers switch perspectives fluidly but tolerate more distal ambiguity. Five state-of-the-art LLMs, by contrast, do not register the proximal-distal distinction at all and produce identical response patterns regardless of language background, defaulting to English-centric logic. The work therefore questions whether current training regimes can produce the grounded, culturally variable spatial reasoning that humans display. It also reframes the egocentric-sociocentric debate by showing that both orientations are present but weighted differently across languages.

Core claim

Using demonstratives as a novel probe for grounded knowledge, the study establishes that English speakers distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently yet tolerate distal ambiguity; five state-of-the-art LLMs fail to inherently understand the proximal-distal contrast and show no cultural differences, defaulting to English-centric reasoning.

What carries the argument

Demonstratives as a probe for embodied cognition and cultural conventions in spatial reference and perspective-taking

If this is right

Introduces a new task based on demonstratives for evaluating embodied cognition and cultural conventions.
Supplies empirical evidence of cross-cultural asymmetries in human interpretation of spatial expressions.
Shows that egocentric and sociocentric orientations coexist but vary in strength across languages.
Calls for future model design to address individual variation within and across languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may need explicit multimodal or embodied training signals to develop the spatial perspective-taking abilities that text alone does not produce.
The same demonstrative probe could be applied to other linguistic categories to map additional gaps in cultural grounding.
Individual differences within each language community could be modeled separately rather than averaged into a single cultural template.

Load-bearing premise

That observed differences in human demonstrative judgments directly reflect embodied cognition and cultural conventions rather than task-specific factors such as education, response style, or prompt interpretation.

What would settle it

A replication in which the same LLMs, after additional training on culturally diverse spatial-language data, begin to reproduce the distinct proximal-distal and perspective-switching patterns shown by native English and Chinese speakers.

Figures

Figures reproduced from arXiv: 2604.25423 by Chu-Ren Huang, Emmanuele Chersoni, Yu Wang.

**Figure 1.** Figure 1: Example illustrating the design of data questions. view at source ↗

**Figure 2.** Figure 2: Distribution of answers across conditions in the experimental group. Purple dashed boxes highlight view at source ↗

**Figure 3.** Figure 3: Distribution of answers across conditions in the control group view at source ↗

read the original abstract

Do large language models (LLMs) truly acquire embodied cognition and cultural conventions from text? We introduce demonstratives, fundamental spatial expressions like "this/that" in English and "zh\`e/n\`a" in Chinese, as a novel probe for grounded knowledge. Using 6,400 responses from 320 native speakers, we establish a human baseline: English speakers reliably distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to inherently understand the proximal-distal contrast and show no cultural differences, defaulting to English-centric reasoning. Our study contributes (i) a new task, based on demonstratives, as a new lens for evaluating embodied cognition and cultural conventions; (ii) empirical evidence of cross-cultural asymmetries in human interpretation; (iii) a new perspective on the egocentric-sociocentric debate, showing both orientations coexist but vary across languages; and (iv) a call to address individual variation in future model design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The human baseline on demonstrative use across English and Chinese is the useful part here, but the LLM results do not cleanly show failure to capture cultural conventions.

read the letter

The paper's core contribution is a human survey of 320 native speakers giving 6400 responses on demonstratives like this/that versus zhè/nà. English speakers show a clear proximal-distal split and limited perspective switching, while Chinese speakers handle perspective shifts more readily but accept more distal ambiguity. That asymmetry is new as a direct probe and ties into the egocentric-sociocentric discussion without forcing one side. The sample size and cross-linguistic design give the baseline real weight for anyone studying spatial language or cultural semantics in humans. The authors also frame the task cleanly as a test for what text training can or cannot produce in models. That framing is straightforward and worth having on record. The LLM side is weaker. Five models are reported to miss the proximal-distal contrast and show no cultural split, defaulting instead to English-like patterns. The abstract does not describe whether the Chinese items were presented with Chinese prompts, English templates, or mixed instructions. If the prompts stayed in English, the English-centric output is exactly what the training distribution predicts and does not test whether the models absorbed the sociocentric patterns from Chinese text. That gap makes the negative claim on embodied and cultural knowledge harder to sustain. The human differences themselves could also reflect education level or response style rather than pure embodied cognition, though the paper treats them as direct evidence of the latter. Readers working on cross-linguistic evaluation or grounded semantics will find the human data worth citing or extending. The LLM conclusions need tighter controls on prompt language before they can be taken as strong evidence against text-only acquisition. I would send this to peer review so the methods section can be examined in detail, especially the exact prompting and statistical setup.

Referee Report

1 major / 2 minor

Summary. The paper introduces demonstratives as a probe for embodied cognition and cultural conventions in LLMs. It reports a human baseline from 6,400 responses by 320 native speakers showing English speakers reliably distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to show the proximal-distal contrast or cultural differences, defaulting to English-centric reasoning. The work contributes a new evaluation task, cross-linguistic human data, and implications for the egocentric-sociocentric debate.

Significance. If the negative LLM result holds after addressing methodological controls, the study offers a falsifiable, cross-linguistic test of whether text-only training suffices for grounded spatial and cultural knowledge. The large human sample provides a clear baseline, and the task design directly targets a linguistic phenomenon with documented embodied and cultural dimensions. This could inform future model evaluation beyond English-centric benchmarks.

major comments (1)

[LLM evaluation / prompting procedure] The central claim that LLMs 'default to English-centric reasoning' and fail to capture Chinese conventions requires explicit confirmation that prompts for Chinese demonstratives (zhè/nà) were presented in Chinese or used language-matched templates. If the LLM evaluation used English-only prompts or English-centric framing even for the Chinese condition (as implied by the abstract's contrast), the observed default follows from training-data distribution and does not isolate absence of cultural knowledge acquired from text. This assumption is load-bearing for the claim that the results demonstrate failure to capture embodied/cultural conventions rather than prompt-language effects. Please add the exact prompt templates, language of presentation, and any controls for prompt language in the LLM methods section.

minor comments (2)

[Results / statistical analysis] Clarify the exact statistical tests and controls used for the human-LLM comparison (e.g., how individual variation and response-style differences were handled).
[Abstract and §5] The abstract states 'no cultural differences' for LLMs; specify whether this is a null result with power analysis or simply absence of the human-like pattern.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We address the major comment point by point below and will incorporate the requested clarifications into a revised version.

read point-by-point responses

Referee: The central claim that LLMs 'default to English-centric reasoning' and fail to capture Chinese conventions requires explicit confirmation that prompts for Chinese demonstratives (zhè/nà) were presented in Chinese or used language-matched templates. If the LLM evaluation used English-only prompts or English-centric framing even for the Chinese condition (as implied by the abstract's contrast), the observed default follows from training-data distribution and does not isolate absence of cultural knowledge acquired from text. This assumption is load-bearing for the claim that the results demonstrate failure to capture embodied/cultural conventions rather than prompt-language effects. Please add the exact prompt templates, language of presentation, and any controls for prompt language in the LLM methods section.

Authors: We appreciate the referee's emphasis on methodological transparency for isolating cultural knowledge from prompt-language effects. In the original experiments, prompts for the Chinese condition used language-matched templates: instructions and response options were written in Chinese, incorporating the demonstratives 'zhè' (proximal) and 'nà' (distal) directly in the prompt text, while the English condition used parallel English templates with 'this' and 'that'. This design was intended to minimize English-centric framing. However, the current manuscript does not include the verbatim prompt templates or an explicit statement confirming the presentation language for each condition. We will revise the LLM evaluation subsection of the Methods to add the full English and Chinese prompt templates, describe the language of presentation, and note any additional controls (e.g., zero-shot consistency checks across languages). This addition will directly address the concern and strengthen the interpretation that the observed English-centric defaults reflect limitations in acquired knowledge rather than prompt artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical comparison of human and LLM responses

full rationale

The paper presents a purely empirical study: it collects 6,400 human judgments from native speakers on demonstrative tasks in English and Chinese, then evaluates five LLMs on the same items and reports descriptive differences. No equations, parameters, or derivations are defined; there are no fitted quantities renamed as predictions, no self-definitional constructs, and no load-bearing self-citations that reduce the central claim to prior author work by construction. The analysis rests on direct data collection and comparison, which is self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that demonstrative usage encodes embodied spatial cognition and cultural conventions that text-trained models should reproduce if they truly capture them.

axioms (1)

domain assumption Demonstratives encode embodied spatial cognition and cultural conventions that can be probed via controlled reference tasks.
Invoked to interpret both human response patterns and LLM failures as evidence about grounded knowledge.

pith-pipeline@v0.9.0 · 5490 in / 1149 out tokens · 61629 ms · 2026-05-07T16:20:11.663654+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation.arXiv preprint arXiv:2512.06690.2025

Think-while-generating: On-the-fly reason- ing for personalized long-form generation.arXiv preprint arXiv:2512.06690. Xiaoyang Wang, Liang Wu, Tianyi Ma, and Bing Liu

work page arXiv
[2]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 13245–13259, Miami, Florida, USA

Fac 2e: Better understanding large language model capabilities by dissociating language and cog- nition. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 13245–13259, Miami, Florida, USA. Association for Computational Linguistics. Yu Wang, Emmanuele Chersoni, and Chu-Ren Huang

2024
[3]

InLanguage Resources and Evaluation Con- ference 2026

This one or that one? a study on accessibility via demonstratives with multimodal large language models. InLanguage Resources and Evaluation Con- ference 2026. European Language Resources Associ- ation (ELRA). Qihui Xu, Yingying Peng, Samuel A Nastase, Martin Chodorow, Minghua Wu, and Ping Li. 2025. Large language models without grounding recover non- sen...

2026
[4]

Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty

Question calibration and multi-hop modeling for temporal question answering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19332–19340. Chao Xue, Yao Wang, Mengqiao Liu, Di Liang, Xing- sheng Han, Peiyang Liu, Xianjie Wu, Chenyao Lu, Lei Jiang, Yu Lu, Haibo Shi, Shuang Liang, Minlong Peng, and Flora D. Salim. 2026. Rea-...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, and Ning Guo. 2025. Futuresightdrive: Thinking visu- ally with spatio-temporal cot for autonomous driving. arXiv preprint arXiv:2505.17685. Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, and Zhiming Z...

work page internal anchor Pith review arXiv 2025