Using Perspectival Words Is Harder Than Vocabulary Words for Humans and Even More So for Multimodal Language Models
Pith reviewed 2026-05-19 13:01 UTC · model grok-4.3
The pith
Multimodal models perform nearly as well as humans on vocabulary words but show larger deficits with possessives and even more with demonstratives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Testing seven multimodal language models against human participants reveals that while models approach human-level performance on vocabulary words, they exhibit clear deficits with possessives and even greater difficulty with demonstratives. Ablation analyses point to limitations in perspective-taking and spatial reasoning as key sources of these performance gaps, and instruction-based prompting reduces the gap for possessives but leaves demonstratives far below human levels.
What carries the argument
Performance comparison across vocabulary, possessive, and demonstrative word types in humans and multimodal models, with ablation studies targeting perspective-taking and spatial reasoning abilities.
If this is right
- Perspective-taking limitations in models contribute to poorer handling of everyday conversational references.
- Instruction prompting can improve use of possessives but has limited effect on demonstratives.
- Multimodal models have shortfalls in pragmatic and social-cognitive abilities compared to humans.
- These word types impose increasing cognitive demands that highlight differences in human and model communication skills.
Where Pith is reading between the lines
- Improving spatial reasoning in models could enhance their ability to refer to objects in shared environments.
- Similar perspective challenges may appear in other language tasks involving context-dependent references.
- Future evaluations of language models should include more tests of deictic and possessive language use.
- Training with interactive scenarios involving multiple viewpoints might close some of these gaps.
Load-bearing premise
The selected tasks for possessives and demonstratives isolate perspective-taking and spatial reasoning without interference from model training data or other design factors.
What would settle it
A new experiment where models are given explicit perspective or spatial cues in the input and show no remaining performance gap on demonstratives would falsify the claim that these cognitive limitations are the primary cause.
read the original abstract
Multimodal language models (MLMs) increasingly demonstrate human-like communication, yet their use of everyday perspectival words remains poorly understood. To address this gap, we compare humans and MLMs in their use of three word types that impose increasing cognitive demands: vocabulary (for example, "boat" or "cup"), possessives (for example, "mine" versus "yours"), and demonstratives (for example, "this one" versus "that one"). Testing seven MLMs against human participants, we find that perspectival words are harder than vocabulary words for both groups. The gap is larger for MLMs: while models approach human-level performance on vocabulary, they show clear deficits with possessives and even greater difficulty with demonstratives. Ablation analyses indicate that limitations in perspective-taking and spatial reasoning are key sources of these gaps. Instruction-based prompting reduces the gap for possessives but leaves demonstratives far below human performance. These results show that, unlike vocabulary, perspectival words pose a greater challenge in human communication, and this difficulty is amplified in MLMs, revealing a shortfall in their pragmatic and social-cognitive abilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper compares human participants and seven multimodal language models on tasks using vocabulary words versus perspectival words (possessives like 'mine' vs 'yours' and demonstratives like 'this' vs 'that'). It reports that perspectival words are harder than vocabulary words for both groups, with a substantially larger gap for MLMs; models approach human performance on vocabulary but show clear deficits on possessives and even greater difficulty on demonstratives. Ablation analyses are presented as evidence that limitations in perspective-taking and spatial reasoning are the primary sources of the MLM gaps, while instruction-based prompting narrows the gap for possessives but leaves demonstratives far below human levels.
Significance. If the central empirical patterns and ablation attributions hold after addressing controls, the work would usefully document a specific pragmatic shortfall in current MLMs relative to humans and provide a benchmark for future model development in social-cognitive language use. The human-MLM comparison and ablation approach constitute a structured empirical contribution in the cs.CL literature on multimodal pragmatics.
major comments (1)
- Ablation analyses (abstract and corresponding results section): The claim that perspective-taking and spatial-reasoning limitations are the key sources of the larger MLM gaps rests on ablation analyses. These analyses are only as strong as their ability to hold constant all other factors (pretraining distribution of spatial language, visual grounding mechanisms, instruction-following robustness). If an ablation that removes perspective cues simultaneously degrades general visual-linguistic alignment or increases prompt brittleness, the performance drop cannot be attributed specifically to perspective-taking. The abstract presents this attribution without reporting controls that would separate these possibilities.
minor comments (2)
- Methods section: The manuscript does not report the number of human participants, the exact stimuli and trial counts, or the statistical methods used to compare performances and interpret ablation effects; these details are needed to evaluate the strength of the reported patterns.
- Results presentation: Tables or figures summarizing per-model and per-condition accuracies, together with confidence intervals or p-values, would improve clarity and allow readers to assess the magnitude of the reported gaps.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. The concern about the ablation analyses is well-taken, and we address it directly below. We maintain that the empirical patterns are robust but agree that the manuscript would benefit from greater caution in attributing the gaps and from additional discussion of controls.
read point-by-point responses
-
Referee: Ablation analyses (abstract and corresponding results section): The claim that perspective-taking and spatial-reasoning limitations are the key sources of the larger MLM gaps rests on ablation analyses. These analyses are only as strong as their ability to hold constant all other factors (pretraining distribution of spatial language, visual grounding mechanisms, instruction-following robustness). If an ablation that removes perspective cues simultaneously degrades general visual-linguistic alignment or increases prompt brittleness, the performance drop cannot be attributed specifically to perspective-taking. The abstract presents this attribution without reporting controls that would separate these possibilities.
Authors: We thank the referee for this important methodological point. Our ablation procedure modified only the perspectival elements (e.g., replacing 'this' with 'that' or 'mine' with 'yours') while holding the visual scene, object labels, and core task instructions fixed across conditions. We observed that vocabulary-word performance remained high and stable under these changes, which provides some evidence that general visual-linguistic alignment and instruction following were not broadly disrupted. Nevertheless, we did not include separate probes for non-perspectival spatial reasoning or explicit measures of prompt sensitivity, so we cannot fully rule out the confounds raised. In the revision we will (1) add a new subsection that explicitly discusses these alternative explanations and reports any available stability checks, (2) replace the abstract phrasing 'indicate that limitations ... are key sources' with the more qualified 'suggest that limitations in perspective-taking and spatial reasoning contribute substantially to these gaps', and (3) note the absence of exhaustive controls as a limitation. We view this as a partial revision: the core comparative results and the direction of the ablation effects stand, but we accept that stronger isolation of the target constructs would strengthen the causal claim. revision: partial
Circularity Check
No significant circularity: empirical study grounded in external human benchmarks
full rationale
This is a standard empirical comparison study that tests seven MLMs against human participants on vocabulary, possessive, and demonstrative tasks, then uses ablation analyses to probe sources of observed gaps. All central claims rest on direct experimental measurements with human performance serving as an independent external benchmark, rather than on any derivation, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes appear; the paper does not rename known results or reduce its findings to its own inputs by construction. The work is therefore self-contained against external benchmarks and receives the default non-circularity outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected word types (vocabulary, possessives, demonstratives) impose increasing cognitive demands on perspective-taking and spatial reasoning
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Ablation analyses indicate that limitations in perspective-taking and spatial reasoning are key sources of these gaps.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.