pith. sign in

arxiv: 2505.21649 · v7 · submitted 2025-05-27 · 💻 cs.CV

Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs

Pith reviewed 2026-05-19 12:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords object orientationmulti-modal large language modelsvision-language modelsbenchmarkspatial reasoning3D perceptionorientation understanding
0
0 comments X

The pith

Multi-modal large language models fail to understand object orientations, scoring at most 54.2% on coarse tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DORI, a benchmark that tests how multi-modal models grasp object orientations apart from position or general scene skills. It uses tasks from 11 datasets spanning 67 categories in synthetic and real images to check frontal alignment, rotational transformations, relative directions, and standard views. Evaluations of 15 leading models show top scores of only 54.2 percent on broad orientation questions and 33 percent on finer ones. Accuracy drops more when tasks involve shifting reference frames or handling several rotations together. The findings point to weak internal 3D spatial models in these systems, which limits reliable use in robotics and augmented reality.

Core claim

DORI establishes that state-of-the-art vision-language models exhibit systematic limitations in orientation comprehension, achieving at most 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments, with performance deteriorating for tasks requiring reference frame shifts or compound rotations, indicating deficiencies in their internal 3D spatial representations.

What carries the argument

DORI benchmark, which assesses four dimensions of orientation comprehension through tasks from 11 datasets across 67 categories in synthetic and real scenarios.

If this is right

  • Robotic manipulation would gain from models that correctly track object rotations and alignments.
  • Augmented reality interactions could place and manipulate objects more accurately with better orientation awareness.
  • 3D scene reconstruction would improve if models could follow orientation changes across different viewpoints.
  • Human-AI systems operating in physical spaces would function more reliably with correct orientation judgments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training models explicitly on rotation and reference-frame examples might reduce the observed errors.
  • The same benchmark approach could help diagnose other spatial reasoning gaps beyond orientation.
  • Fixing these issues may require new internal representations that treat 3D rotations as a core capability.

Load-bearing premise

The curated tasks drawn from 11 datasets across 67 categories successfully isolate pure orientation comprehension without conflating it with positional relationships or general scene understanding.

What would settle it

A model achieving substantially higher accuracy, such as above 80 percent, on the granular orientation judgment tasks in DORI would indicate that the reported systematic failures are not present.

Figures

Figures reproduced from arXiv: 2505.21649 by Bryan A. Plummer, Deepti Ghadiyaram, Elva Zou, Keanu Nichols, Nazia Tasnim, Nicholas Ikechukwu, Yuting Yan.

Figure 1
Figure 1. Figure 1: DORI captures four core dimensions of orientation reasoning intelligence: (1) object’s directional alignment, (2) its orientation relative to viewers, scenes, and other objects, (3) required rotational transformation for different objectives, and (4) its natural/canonical orientation in the world. Each dimension evaluates specific perceptual abilities through visual tasks in varying settings. DORI provides… view at source ↗
Figure 2
Figure 2. Figure 2: Structured prompt design and example question–answer pairs from DORI. Each query follows a consistent format comprising a task description, contextual definition, step-by-step instructions, multiple-choice options, and illustrative examples, enabling systematic evaluation of orientation perception. All questions include the option “Can￾not be determined,” maintaining a uniform answer space while explicitly… view at source ↗
Figure 3
Figure 3. Figure 3: Representative samples from each dataset, including natural (Kitti, Cityscapes, Coco, SSFRB, etc.) and simulated (JTA, 3D-Future, Get-3D, etc.) sources. expose not just whether models fail, but where and why: how closely do they match human-level orientation performance, do their reasoning patterns reflect those observed in human visuospatial cognition [8,88,99], and what architectural or training factors … view at source ↗
Figure 4
Figure 4. Figure 4: (a) DORI comprises seven orientation tasks with a balanced distribution across natural and simulated images. (b) It features diverse everyday objects to comprehensively evaluate orientation understanding. oriented relative to the camera position (e.g., whether a desk is facing toward, away, leftward, or rightward from the viewer’s perspective). This dual assessment is supported by research that reports vie… view at source ↗
Figure 5
Figure 5. Figure 5: Performance of MLLMs by source category (additional models in supple￾mentary). Although for many categories the relative ranking of methods is rela￾tively stable, in a few cases, like the food category, most models perform poorly. 3D_Future Ssfrb Cityscapes Coco Get_3D Jta Kitti Nocs_Real Objectron Omniobject3D Shapenet 0 20 40 60 Model Gemini_1.5_Pro Gemini_2.0_Flash Gpt_4_1 Llava_34B Mantis_8B_Clip [PIT… view at source ↗
Figure 7
Figure 7. Figure 7: Sample category distribution of top 25 object categories showing their sources categories but also on long-tail classes, thereby encouraging more balanced and comprehensive assessment [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sample category distribution of top 25 object categories showing their associ￾ated orientation tasks covering each of the seven distinct orientation reasoning tasks. These qualita￾tive illustrations shed light on the nuanced challenges faced by state-of-the-art Multimodal Large Language Models (MLLMs), beyond aggregate metrics. Each example is carefully chosen to represent either a canonical failure or, in… view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of fine-grained object categories mapped to broader semantic classes. Most categories fall under Vehicle, Miscellaneous, and Animals, reflecting com￾mon object types in the evaluated datasets. imbue models with a deeper, more structured understanding of object orienta￾tion dynamics. A.5 Enumerated List of Questions Coarse Questions – Q1 - View Parallelism: Determine which way Object A’s front … view at source ↗
Figure 10
Figure 10. Figure 10: An example VQA from the View Parallelism orientation task illustrating a failure case on a coarse-level question in DORI, with a prediction from Gemini 1.5 Pro. Many models failed this question, highlighting a common challenge among MLLMs in understanding fundamental scene geometry. Specifically, in reasoning whether an object’s front-facing surface is oriented toward, away from, or perpendicular to the c… view at source ↗
Figure 11
Figure 11. Figure 11: An example VQA from the View Parallelism orientation task highlighting a failure case on a granular-level question in DORI, with a response from Gemini 2.0 Flash. Although the model confidently selects an answer, its reasoning reflects a fundamental misunderstanding of the object’s orientation relative to the camera. While the ground truth indicates the object is turned between 135° and 180° away, the mod… view at source ↗
Figure 12
Figure 12. Figure 12: A coarse-level VQA from the Directional Facing task in DORI, illustrating a failure case by GPT-4-1. While the ground truth indicates a leftward orientation, the model predicts ‘away,’ misreading body posture and head direction. Notably, 13 out of 15 models failed this question, underscoring a widespread struggle with coarse directional inference. Prompt: TASK: Determine which direction Object A's front-f… view at source ↗
Figure 13
Figure 13. Figure 13: A coarse-level VQA from the Directional Facing task in DORI, showing a failure case by Gemini-2.0-Flash. While the ground truth identifies the giraffe’s orien￾tation as rightward, the model incorrectly infers a leftward direction. This highlights the difficulty in interpreting animal pose and orientation cues, a challenge shared by the majority of models in this example [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 14
Figure 14. Figure 14: A granular-level VQA from the Single-Axis Rotation task in DORI, fea￾turing a case where GPT-4o selects the correct answer (180° rotation) but provides flawed reasoning. While the model concludes with the correct choice, its explanation suggests uncertainty due to perceived ambiguity in object orientation. This mismatch between prediction and rationale underscores gaps in visual reasoning, with only 2 out… view at source ↗
Figure 15
Figure 15. Figure 15: A granular-level VQA from the Single-Axis Rotation in DORI, showing a failure case with a prediction from Gemini 2.0 Flash. While the correct answer is a 180° rotation, the model incorrectly selects 90° (option B), suggesting a misjudgment of the object’s current orientation relative to the camera. This highlights challenges models face in reasoning about precise rotational alignment. Prompt: TASK: Determ… view at source ↗
Figure 16
Figure 16. Figure 16: A coarse-level VQA from the Compound Rotation task in DORI, with a correct prediction from Gemini 2.0 Flash. The model identifies a horizontal axis rotation, but its reasoning overstates the change, describing the car as ‘upside down’ when it is only partially inverted [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: A granular-level VQA from the Compound Rotation task in DORI showing a failure case from Gemini 2.0 Flash. While the model selects the correct option, its reasoning misidentifies the vertical rotation, describing a 180° flip when the ground truth indicates a 90° transformation. Prompt: TASK: Determine if objects A and B are facing each other from their own perspectives. CONTEXT: Objects A and B are marked… view at source ↗
Figure 18
Figure 18. Figure 18: A coarse-level VQA from the Inter-object Direction task in DORI show￾ing a failure case with GPT-4-1. Although the correct answer is ’Partially facing the same direction,’ the model incorrectly selects ’Partially facing opposite directions,’ misjudging the relative orientations. Notably, 13 out of 15 models failed this example, highlighting a shared difficulty in reasoning about partially aligned object d… view at source ↗
Figure 19
Figure 19. Figure 19: A granular-level VQA from the Inter-object Direction task in DORI illus￾trating a failure case with Gemini 1.5 Pro. While the ground truth indicates a required clockwise rotation between 46° and 90° for alignment, the model underestimates this angle. Its reasoning misinterprets the spatial alignment between the chair and sofa. All 15 models failed this example, underscoring the challenge of inter-object d… view at source ↗
Figure 20
Figure 20. Figure 20: Viewer-Scene Direction. Example success case on a granular-level DORI question with Gemini-2.0-Flash. While the model correctly selects the 90-degree rota￾tion, it inaccurately describes the directional shift as ’right’ to ’bottom’ instead of the more precise ’bottom-right’ to ’bottom-left,’ reflecting a partial misunderstanding of fine-grained orientation. Prompt: TASK: Determine how many degrees clockwi… view at source ↗
Figure 21
Figure 21. Figure 21: Viewer-Scene Direction. Example failure case on a granular-level DORI question with GPT-4-1. Although the model selects the correct answer of 180 degrees, its reasoning mistakenly describes a 90-degree rotation, highlighting a disconnect be￾tween answer selection and spatial understanding. Notably, 13 out of 15 models failed this case [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Canonical Orientation. Example failure case on a coarse-level DORI ques￾tion with Gemini-2.0-Flash. While the ground truth is ’Cannot be determined,’ the model incorrectly selects a definitive orientation. Its reasoning contradicts the inher￾ent ambiguity it acknowledges, exposing uncertainty in handling objects with no fixed canonical pose [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: An example of clockwise and counter-clockwise rotations for the Inter-object direction perception task, we utilized samples from the NOCS REAL dataset [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: examines the relationship between answer set size and model accuracy. We observe a peak in performance at 3-option questions (42.5%), with accuracy declining markedly as the number of candidate answers increases. At 5 and 6 options, performance drops to 26% and 19%, respectively, with a further decline to 13% at 9 options, and a minimum of 6% at 16 options. This degradation illustrates the difficulty MLLM… view at source ↗
Figure 25
Figure 25. Figure 25: Performance of 18 leading Multimodal Large Language Models (MLLMs) across data types (a) and annotation granularity levels (b). Gemini and GPT models lead overall. Models perform noticeably better on simulated datasets and coarse-type questions, revealing a gap in generalizing to natural images and more fine-grained questions. the different categories with Food being the one that appears to standout. In t… view at source ↗
Figure 26
Figure 26. Figure 26: Model accuracy for the 18 models across the broad categories described in SectionSec. A.3 with most categories tending to performing similarly except for Food which tends to have lower performances except for GPT-4o. Which highlights a gap for orientation related questions related to food. B.3 Task Complexity Hierarchy Our multi-dimensional analysis reveals a clear hierarchy of difficulty in orienta￾tion … view at source ↗
Figure 27
Figure 27. Figure 27: Model accuracy for the 18 models across the data-sources present in DORI. We see that overall the closed source models tend to perform better than the open source models. With more real and smaller datasets like NOCS_Real and Objectron proving to be challenging for most of the models [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Performance of 18 leading Multimodal Large Language Models (MLLMs) different levels of our questions hierachy with (left) being the being the four core dimensions of orientation and (right) being the different questions present in DORI [PITH_FULL_IMAGE:figures/full_fig_p037_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Comparing the mean and standard deviation of soft vs. standard (hard) ac￾curacy on View Parallelism task. See discussion Sec.Sec. C.3. tions. For each model and question type combination, the formula error = std_accuracy/sqrt(seed_count) was applied. We used 3 different seeds: [42, 1998, 107983]. Looking at the View Parallelism task ( [PITH_FULL_IMAGE:figures/full_fig_p043_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Comparing the mean and standard deviation of soft vs. standard (hard) ac￾curacy on Directional Facing task. See discussion Sec.Sec. C.3. dropping below 10% for all models. LLaVA-Next-8B shows the highest coarse performance but extremely low granular performance. The error bars for gran￾ular questions are relatively tight (except in LLaVa-13B-base), suggesting that models consistently struggle with this ta… view at source ↗
Figure 31
Figure 31. Figure 31: Comparing the mean and standard deviation of soft vs. standard (hard) ac￾curacy on Single-axis Rotation task. See discussion Sec.Sec. C.3. tectures encode rotation perception in fundamentally different ways. This task exposes fundamental inconsistencies in how current MLLMs process orientation changes, suggesting that rotation tracking may rely on different computational mechanisms than static orientation… view at source ↗
Figure 32
Figure 32. Figure 32: Comparing the mean and standard deviation of soft vs. standard (hard) ac￾curacy on Compound Rotation task. See discussion Sec.Sec. C.3. racy but also reduces initialization sensitivity. Finally, the performance inversions observed between coarse and granular questions for some models highlight the disconnect between categorical and precise metric orientation understanding, a fundamental challenge that per… view at source ↗
Figure 33
Figure 33. Figure 33: Comparing the mean and standard deviation of soft vs. standard (hard) ac￾curacy on Inter-object Direction task. See discussion Sec.Sec. C.3. Armed with these splits, we compared performance on LLava-13B, LLava￾34B, Yi-VL-6B and Deepseek-7B, shown below. Interestingly, the OOP results are generally better than ALL, which seems counterintuitive at first glance. However, we suspect that this is an effect of … view at source ↗
Figure 34
Figure 34. Figure 34: Comparing the mean and standard deviation of soft vs. standard (hard) ac￾curacy on Viewer-scene Direction task. See discussion Sec.Sec. C.3. deep-1.3B-chat llava-13B-baseqwen-3B-instr deep-7B-basedeep-7B-chat llava-next-8B deep-1.3B-base Model 0 20 40 60 80 100 Accuracy (%) 16.1 11.6 11.4 10.9 10.8 9.6 4.5 18.2 14.9 15.9 12.3 11.7 13.6 5.2 Question q7 Granular Performance: Standard vs Soft Accuracy Standa… view at source ↗
Figure 35
Figure 35. Figure 35: Comparing the mean and standard deviation of soft vs. standard (hard) ac￾curacy on Inter-object Direction task. See discussion Sec.Sec. C.3 [PITH_FULL_IMAGE:figures/full_fig_p048_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Relative gain from using Soft accuracy rather than Standard (hard) accuracy per question. SeeSec. C.3 for discussion deep-7B-chat qwen-3B-instr llava-13B-base deep-1.3B-chat deep-1.3B-base llava-next-8B deep-7B-base Model 0 20 40 60 80 100 Accuracy (%) 58.4 56.3 56.3 39.6 37.3 36.2 31.3 29.7 30.6 26.7 24.9 24.3 25.5 25.4 Question q1 Performance: Coarse vs Granular Coarse Granular [PITH_FULL_IMAGE:figures… view at source ↗
Figure 37
Figure 37. Figure 37: Mean and Standard Deviation of various models on View Parallelism task [PITH_FULL_IMAGE:figures/full_fig_p049_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Mean and Standard Deviation of various models on Directional Facing task deep-7B-chat llava-13B-base deep-1.3B-base deep-1.3B-chatdeep-7B-base llava-next-8B qwen-3B-instr Model 0 20 40 60 80 100 Accuracy (%) 27.4 22.0 21.0 20.4 20.3 20.1 18.6 18.2 16.4 16.9 16.8 11.9 17.8 17.7 Question q3 Performance: Coarse vs Granular Coarse Granular [PITH_FULL_IMAGE:figures/full_fig_p050_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Mean and Standard Deviation of various models on Single-axis Rotation task [PITH_FULL_IMAGE:figures/full_fig_p050_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Mean and Standard Deviation of various models on Compound Rotation task deep-7B-chat deep-1.3B-base deep-1.3B-chat llava-next-8B qwen-3B-instr llava-13B-basedeep-7B-base Model 0 20 40 60 80 100 Accuracy (%) 39.2 20.2 17.2 16.6 16.3 11.1 8.4 12.7 10.9 10.9 14.1 14.2 14.0 7.8 Question q5 Performance: Coarse vs Granular Coarse Granular [PITH_FULL_IMAGE:figures/full_fig_p051_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Mean and Standard Deviation of various models on Inter-object direction [PITH_FULL_IMAGE:figures/full_fig_p051_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Mean and Standard Deviation of various models on Viewer-Scene direction task. deep-7B-base deep-1.3B-chat llava-13B-base llava-next-8B deep-7B-chat qwen-3B-instr deep-1.3B-base Model 0 20 40 60 80 100 Accuracy (%) 29.1 27.8 15.1 10.7 10.5 7.5 0.1 10.9 16.1 11.6 9.6 10.8 11.4 4.5 Question q7 Performance: Coarse vs Granular Coarse Granular [PITH_FULL_IMAGE:figures/full_fig_p052_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Mean and Standard Deviation of various models on the Canonical Orientation task [PITH_FULL_IMAGE:figures/full_fig_p052_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: An example of the high level instructions shown for the Human Evaluation D Human Evaluation Interface An example of the high level instructions shown for this task can be seen in [PITH_FULL_IMAGE:figures/full_fig_p054_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: An example of a sample for the coarse-level VQA for the Directional Facing task in DORI shown for the Human Evaluation [PITH_FULL_IMAGE:figures/full_fig_p055_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: An example of a sample for the coarse-level VQA for the Viewer-Scenen Direction task in DORI shown for the Human Evaluation [PITH_FULL_IMAGE:figures/full_fig_p056_46.png] view at source ↗
read the original abstract

Object orientation understanding represents a fundamental challenge in visual perception critical for applications like robotic manipulation and augmented reality. Current vision-language benchmarks fail to isolate this capability, often conflating it with positional relationships and general scene understanding. We introduce DORI (Discriminative Orientation Reasoning Intelligence), a comprehensive benchmark establishing object orientation perception as a primary evaluation target. DORI assesses four dimensions of orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. Through carefully curated tasks from 11 datasets spanning 67 object categories across synthetic and real-world scenarios, DORI provides insights on how multi-modal systems understand object orientations. Our evaluation of 15 state-of-the-art vision-language models reveals critical limitations: even the best models achieve only 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments, with performance deteriorating for tasks requiring reference frame shifts or compound rotations. These findings demonstrate the need for dedicated orientation representation mechanisms, as models show systematic inability to perform precise angular estimations, track orientation changes across viewpoints, and understand compound rotations - suggesting limitations in their internal 3D spatial representations. As the first diagnostic framework specifically designed for orientation awareness in multimodal systems, DORI offers implications for improving robotic control, 3D scene reconstruction, and human-AI interaction in physical environments. DORI data: https://huggingface.co/datasets/appledora/DORI-Benchmark

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the DORI benchmark to evaluate object orientation understanding in multi-modal large language models (MLLMs). It claims that even the best models achieve only 54.2% accuracy on coarse orientation tasks and 33.0% on granular judgments, with performance worsening for tasks involving reference frame shifts or compound rotations. The benchmark is constructed from tasks curated from 11 datasets spanning 67 object categories in both synthetic and real-world scenarios, assessing four dimensions: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding.

Significance. If the benchmark tasks successfully isolate orientation comprehension, the findings would highlight important gaps in MLLMs' internal 3D spatial representations and motivate dedicated orientation mechanisms. The public release of the DORI dataset supports potential reproducibility and follow-up work in robotics and scene understanding.

major comments (2)
  1. [Abstract] Abstract: The claim that DORI tasks isolate pure orientation comprehension from positional relationships and general scene understanding lacks any methodological detail on controls for confounds (e.g., object position, viewpoint statistics, lighting, or canonical shape cues). This is load-bearing for the central claim because the reported accuracy drops (54.2% coarse, 33.0% granular) and deterioration on reference-frame shifts cannot be attributed specifically to orientation deficits without such isolation.
  2. [Abstract] Abstract: No information is provided on task validation, inter-annotator agreement, or statistical significance testing for the accuracy figures and performance trends across the 15 models. This undermines the reliability of the headline conclusion that models exhibit systematic orientation failures.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'carefully curated tasks' is used without elaboration on the curation criteria or balancing procedures; expanding this in the full manuscript would improve clarity of the benchmark construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that DORI tasks isolate pure orientation comprehension from positional relationships and general scene understanding lacks any methodological detail on controls for confounds (e.g., object position, viewpoint statistics, lighting, or canonical shape cues). This is load-bearing for the central claim because the reported accuracy drops (54.2% coarse, 33.0% granular) and deterioration on reference-frame shifts cannot be attributed specifically to orientation deficits without such isolation.

    Authors: We agree that the abstract, due to length constraints, does not explicitly detail confound controls. The manuscript describes curation from 11 datasets across 67 categories in synthetic and real-world settings to emphasize object-centric views and reduce positional confounds, with varied viewpoints and lighting conditions to limit shape and illumination biases. To strengthen the presentation, we will revise the abstract to include a brief summary of these isolation strategies. revision: yes

  2. Referee: [Abstract] Abstract: No information is provided on task validation, inter-annotator agreement, or statistical significance testing for the accuracy figures and performance trends across the 15 models. This undermines the reliability of the headline conclusion that models exhibit systematic orientation failures.

    Authors: The abstract prioritizes headline results and does not cover validation procedures. The full manuscript includes task validation steps, inter-annotator agreement metrics, and statistical significance testing for the reported accuracies and trends. We will revise the abstract to add a concise reference to these validation elements for improved transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark results derived directly from task evaluations

full rationale

The paper introduces DORI as a new benchmark with tasks curated from 11 public datasets across 67 categories. Reported accuracies (54.2% coarse, 33.0% granular) are direct model evaluations on these tasks, with no equations, fitted parameters, or self-citations appearing in the abstract. Conclusions about orientation limitations follow from the performance numbers without any reduction to inputs by construction, self-definitional loops, or renamed known results. The derivation chain is self-contained as an empirical evaluation framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility into explicit parameters or axioms; the central structural assumption is that the four listed dimensions cleanly separate orientation from other visual capabilities.

axioms (1)
  • domain assumption The four dimensions of frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding represent distinct and isolable aspects of object orientation comprehension.
    Used to define the structure of the benchmark tasks.

pith-pipeline@v0.9.0 · 5786 in / 1126 out tokens · 46300 ms · 2026-05-19T12:34:06.749436+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    DORI assesses four dimensions of orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding... even the best models achieve only 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Drawing on established frameworks in cognitive neuroscience, DORI decomposes object orientation understanding into the four dimensions... each corresponding to a separable perceptual or reasoning mechanism

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

    cs.CV 2026-05 accept novelty 7.0

    WildRoadBench provides a professionally annotated UAV corpus and dual-track protocol showing frontier VLMs and LLM agents achieve limited performance on wild aerial road-damage grounding under unified metrics.

  2. Why MLLMs Struggle to Determine Object Orientations

    cs.CV 2026-04 accept novelty 7.0

    Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.

Reference graph

Works this paper leans on

127 extracted references · 127 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ahmadyan, A., Zhang, L., Ablavatski, A., Wei, J., Grundmann, M.: Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7822–7831 (2021)

  2. [2]

    AI, ., :, Young, A., Chen, B., Li, C., Huang, C., Zhang, G., Zhang, G., Li, H., Zhu, J., Chen, J., Chang, J., Yu, K., Liu, P., Liu, Q., Yue, S., Yang, S., Yang, S., Yu, T., Xie, W., Huang, W., Hu, X., Ren, X., Niu, X., Nie, P., Xu, Y., Liu, Y., Wang, Y., Cai, Y., Gu, Z., Liu, Z., Dai, Z.: Yi: Open foundation models by 01.ai (2024)

  3. [3]

    Artificial Intelligence303, 103637 (2022).https://doi.org/10.1016/j.artint.2021.103637

    Alomari, M., Li, F., Hogg, D., Cohn, A.G.: Online perceptual learning and natural language acquisition for autonomous robots. Artificial Intelligence303, 103637 (2022).https://doi.org/10.1016/j.artint.2021.103637

  4. [4]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Azuma, D., Miyanishi, T., Kurita, S., Kawanabe, M.: Scanqa: 3d question answer- ing for spatial scene understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  5. [5]

    Vision Research51, 470–478 (2011).https://doi

    Baker, T.J., Norcia, A.M., Candy, T.R.: Orientation tuning in the visual cortex of 3-month-old human infants. Vision Research51, 470–478 (2011).https://doi. org/10.1016/j.visres.2011.01.003

  6. [6]

    Ballaz, C., Boutsen, L., Peyrin, C., Humphreys, G.W., Marendaz, C.: Visual searchforobjectorientationcanbemodulatedbycanonicalorientation.Journalof Experimental Psychology: Human Perception and Performance31(1), 20 (2005)

  7. [7]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al.: Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)

  8. [8]

    Advances in child development and behavior25, 157–199 (1994)

    Blades, M., Spencer, C.: The development of children’s ability to use spatial rep- resentations. Advances in child development and behavior25, 157–199 (1994)

  9. [9]

    Vision Research45(25), 3169–3179 (2005)

    Braddick, O., Birtles, D., Wattam-Bell, J., Atkinson, J.: Motion- and orientation- specific cortical responses in infancy. Vision Research45(25), 3169–3179 (2005). https://doi.org/https://doi.org/10.1016/j.visres.2005.07.021,https: //www.sciencedirect.com/science/article/pii/S0042698905003846

  10. [10]

    Cognitive psychology105, 9–38 (2018)

    Bramley, N.R., Gerstenberg, T., Tenenbaum, J.B., Gureckis, T.M.: Intuitive ex- perimentation in the physical world. Cognitive psychology105, 9–38 (2018)

  11. [11]

    In: NERA Conference Proceedings 2024 (2024)

    Castle, C.: Using design thinking to create human-centered assessments. In: NERA Conference Proceedings 2024 (2024)

  12. [12]

    bioRxiv pp

    Chavez, R.S., Guthrie, T.D., Kapustka, J.M.: Independent allocentric and ego- centric reference frame encoding of person knowledge within the brains of social groups. bioRxiv pp. 2023–09 (2023)

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14455–14465 (June 2024)

  14. [14]

    Machines11(2), 275 (2023)

    Chen, Y.L., Cai, Y.R., Cheng, M.Y.: Vision-based robotic object grasping—a deep reinforcement learning approach. Machines11(2), 275 (2023)

  15. [15]

    In: NeurIPS (2024)

    Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. In: NeurIPS (2024)

  16. [16]

    Advances in Neural Information Processing Systems37, 135062–135093 (2025) 61

    Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems37, 135062–135093 (2025) 61

  17. [17]

    KI-97: Advances in Artificial Intelligence pp

    Cohn, A.G.: Qualitative spatial representation and reasoning techniques. KI-97: Advances in Artificial Intelligence pp. 1–30 (1997).https://doi.org/10.1007/ 3540634932_1

  18. [18]

    IEEE Transactions on Cybernetics53(3), 1682–1698 (2021)

    Cong, Y., Chen, R., Ma, B., Liu, H., Hou, D., Yang, C.: A comprehensive study of 3-d vision-based robot manipulation. IEEE Transactions on Cybernetics53(3), 1682–1698 (2021)

  19. [19]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

  20. [20]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)

    Corsetti, J., Boscaini, D., Oh, C., Cavallaro, A., Poiesi, F.: Open-vocabulary ob- ject 6d pose estimation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 18071–18080 (June 2024)

  21. [21]

    Creativity Research Journal35, 23–32 (2022).https://doi.org/10.1080/10400419.2022.2049532

    Cortes, R.A., Colaizzi, G.A., Dyke, E.L., Peterson, E.G., Walker, D.L., Kolvoord, R.A., Uttal, D.H., Green, A.E.: Individual differences in parietal and premotor activity during spatial cognition predict figural creativity. Creativity Research Journal35, 23–32 (2022).https://doi.org/10.1080/10400419.2022.2049532

  22. [22]

    Comprehensive Physiology8(4), 1575 (2018)

    Delhaye, B.P., Long, K.H., Bensmaia, S.J.: Neural basis of touch and propriocep- tion in primate cortex. Comprehensive Physiology8(4), 1575 (2018)

  23. [23]

    CoRR abs/2410.09312(2024)

    Deng, Q., Deb, O., Patel, A., Rupprecht, C., Torr, P., Trigoni, N., Markham, A.: Towards multi-modal animal pose estimation: An in-depth analysis. CoRR abs/2410.09312(2024)

  24. [24]

    In: 2020 IEEE International Conference on Robotics and Automation (ICRA)

    Deng,X.,Xiang,Y.,Mousavian,A.,Eppner,C.,Bretl,T.,Fox,D.:Self-supervised 6d object pose estimation for robot manipulation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 3665–3671. IEEE (2020)

  25. [25]

    Cerebral Cortex33(22), 11146–11156 (2023)

    Doganci, N., Iannotti, G.R., Coll, S.Y., Ptak, R.: How embodied is cognition? fmri and behavioral evidence for common neural resources underlying motor plan- ning and mental rotation of bodily stimuli. Cerebral Cortex33(22), 11146–11156 (2023)

  26. [26]

    Du, G., Wang, K., Lian, S., Zhao, K.: Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review. Artif. Intell. Rev.54(3), 1677–1734 (Mar 2021)

  27. [27]

    Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y

    Du, M., Wu, B., Li, Z., Huang, X., Wei, Z.: EmbSpatial-bench: Benchmark- ing spatial understanding for embodied tasks with large vision-language mod- els. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 346–355. Association for Computationa...

  28. [28]

    In: Proceedings of the European conference on computer vision (ECCV)

    Fabbri,M.,Lanzi,F.,Calderara,S.,Palazzi,A.,Vezzani,R.,Cucchiara,R.:Learn- ing to detect and track visible and occluded body joints in a virtual world. In: Proceedings of the European conference on computer vision (ECCV). pp. 430–446 (2018)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Feng, Y., Lin, J., Dwivedi, S.K., Sun, Y., Patel, P., Black, M.J.: Chatpose: Chat- ting about 3d human pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2093–2103 (June 2024)

  30. [30]

    Trends in cognitive sciences18(10), 536–542 (2014)

    Frick, A., Möhring, W., Newcombe, N.S.: Development of mental transformation abilities. Trends in cognitive sciences18(10), 536–542 (2014)

  31. [31]

    In: ACM SIGGRAPH 2008 Papers

    Fu, H., Cohen-Or, D., Dror, G., Sheffer, A.: Upright orientation of man-made objects. In: ACM SIGGRAPH 2008 Papers. SIGGRAPH ’08, Association for 62 Computing Machinery, New York, NY, USA (2008).https://doi.org/10.1145/ 1399504.1360641,https://doi.org/10.1145/1399504.1360641

  32. [32]

    International Journal of Computer Vision129, 3313–3337 (2021)

    Fu, H., Jia, R., Gao, L., Gong, M., Zhao, B., Maybank, S., Tao, D.: 3d-future: 3d furniture shape with texture. International Journal of Computer Vision129, 3313–3337 (2021)

  33. [34]

    In: European Conference on Computer Vision

    Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not perceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)

  34. [35]

    IEEE Access (2024)

    Ganesan, M., Kandhasamy, S., Chokkalingam, B., Mihet-Popa, L.: A comprehen- sive review on deep learning-based motion planning and end-to-end learning for self-driving vehicle. IEEE Access (2024)

  35. [36]

    Gao, J., Sarkar, B., Xia, F., Xiao, T., Wu, J., Ichter, B., Majumdar, A., Sadigh, D.:Physicallygroundedvision-languagemodels for roboticmanipulation.In:2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 12462– 12469 (2024).https://doi.org/10.1109/ICRA57147.2024.10610090

  36. [37]

    Advances In Neural Information Processing Systems35, 31841– 31854 (2022)

    Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S.: Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems35, 31841– 31854 (2022)

  37. [38]

    In: 2012 IEEE conference on computer vision and pattern recognition

    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 3354–3361. IEEE (2012)

  38. [39]

    Human brain mapping33(4), 895–908 (2012)

    Goble, D.J., Coxon, J.P., Van Impe, A., Geurts, M., Van Hecke, W., Sunaert, S., Wenderoth, N., Swinnen, S.P.: The neural basis of central proprioceptive process- ing in older versus younger adults: an important sensory role for right putamen. Human brain mapping33(4), 895–908 (2012)

  39. [40]

    Nature Neu- roscience10, 512–522 (2007).https://doi.org/10.1038/nn1865

    Golarai, G., Ghahremani, D.G., Whitfield-Gabrieli, S., Reiss, A.L., Eberhardt, J.L., Gabrieli, J.D.E., Grill-Spector, K.: Differential development of high-level visual cortex correlates with category-specific recognition memory. Nature Neu- roscience10, 512–522 (2007).https://doi.org/10.1038/nn1865

  40. [41]

    In: International Conference on Soft Computing and Pattern Recognition

    Gontier, C., Jordan, J., Petrovici, M.A.: Delaunay: A dataset of abstract art for psychophysical and machine learning research. In: International Conference on Soft Computing and Pattern Recognition. pp. 134–143. Springer (2023)

  41. [42]

    Journal of cognitive neuroscience22(12), 2836–2849 (2010)

    Gramann, K., Onton, J., Riccobon, D., Mueller, H.J., Bardins, S., Makeig, S.: Human brain dynamics accompanying use of egocentric and allocentric reference frames during navigation. Journal of cognitive neuroscience22(12), 2836–2849 (2010)

  42. [43]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA) (2024)

    Guan, T., Yang, Y., Cheng, H., Lin, M., Kim, R., Madhivanan, R., Sen, A., Manocha, D.: LOC-ZSON: language-driven object-centric zero-shot object re- trieval and navigation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA) (2024)

  43. [44]

    NSF Award Num- ber 1729205

    Guibas, L.J.: Collaborative Research: CI-P: ShapeNet: An Information-Rich 3D Model Repository for Graphics, Vision and Robotics Research. NSF Award Num- ber 1729205. Directorate for Computer and Information Science and Engineering, Division Of Computer and Network Systems. 2017. (Sep 2017) 63

  44. [45]

    In: Proceedings of the 33rd annual con- ference of the cognitive science society

    Hamrick, J., Battaglia, P., Tenenbaum, J.B.: Internal physics models guide proba- bilistic judgments about object dynamics. In: Proceedings of the 33rd annual con- ference of the cognitive science society. vol. 2. Cognitive Science Society Austin, TX (2011)

  45. [46]

    Psychonomic Bulletin & Review31(4), 1503–1515 (2024)

    Harris, I.M.: Interpreting the orientation of objects: A cross-disciplinary review. Psychonomic Bulletin & Review31(4), 1503–1515 (2024)

  46. [47]

    arXiv preprint arXiv:2412.05127 (2024)

    Hewing, M., Leinhos, V.: The prompt canvas: a literature-based practitioner guide for creating effective prompts in large language models. arXiv preprint arXiv:2412.05127 (2024)

  47. [48]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  48. [49]

    In: Ku, L.W., Martins, A., Srikumar, V

    Huang, W., Liu, H., Guo, M., Gong, N.: Visual hallucinations of multi-modal large language models. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguistics: ACL 2024. pp. 9614–9631. Association for Computational Linguistics, Bangkok, Thailand (Aug 2024).https://doi. org/10.18653/v1/2024.findings-acl.573,http...

  49. [50]

    Foundations and trends®in computer graphics and vision12(1–3), 1–308 (2020)

    Janai, J., Güney, F., Behl, A., Geiger, A., et al.: Computer vision for autonomous vehicles: Problems, datasets and state of the art. Foundations and trends®in computer graphics and vision12(1–3), 1–308 (2020)

  50. [51]

    Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., Yi, L.: Omnis- patial: Towards comprehensive spatial reasoning benchmark for vision language models.In:TheFourteenthInternationalConferenceonLearningRepresentations (2026),https://openreview.net/forum?id=6nZKT2rL0H

  51. [52]

    Transactions on Machine Learning Research2024(2024),https://openreview.net/forum?id=skLtdUVaJa

    Jiang, D., He, X., Zeng, H., Wei, C., Ku, M.W., Liu, Q., Chen, W.: Mantis: Interleaved multi-image instruction tuning. Transactions on Machine Learning Research2024(2024),https://openreview.net/forum?id=skLtdUVaJa

  52. [53]

    In: 2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)

    Jung, J.H., Kim, E.T., Kim, S., Lee, J.H., Kim, B., Chang, B.: Is ‘right’right? enhancing object orientation understanding in multimodal large language models through egocentric instruction tuning. In: 2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 14257–14267. IEEE (2025)

  53. [54]

    Journal of Vi- sion23(10), 9–9 (2023)

    Kallmayer, A., Võ, M.L.H., Draschkow, D.: Viewpoint dependence and scene con- text effects generalize to depth rotated three-dimensional objects. Journal of Vi- sion23(10), 9–9 (2023)

  54. [55]

    Neurosci5, 713–728 (2024).https://doi.org/10.3390/neurosci5040050

    Kostakos, K., Pliakopanou, A., Meimaridis, V., Galanou, O., Anagnostou, A., Ser- tidou, D., Katis, P., Anastasiou, P., Katsoulidis, K., Lykogiorgos, Y., Mytilinaios, D., Katsenos, A.P., Simos, Y.V., Bellos, S., Konitsiotis, S., Peschos, D., Tsamis, K.I.: Development of spatial memory: A behavioral study. Neurosci5, 713–728 (2024).https://doi.org/10.3390/n...

  55. [56]

    Visual Cognition9(1-2), 248–264 (2002).https://doi.org/ 10.1080/13506280143000421

    Kourtzi, Z., Nakayama, K.: Distinct mechanisms for the representation of moving and static objects. Visual Cognition9(1-2), 248–264 (2002).https://doi.org/ 10.1080/13506280143000421

  56. [57]

    In: European Conference on Computer Vision

    Krishnan, A., Kundu, A., Maninis, K.K., Hays, J., Brown, M.: Omninocs: A uni- fied nocs dataset and model for 3d lifting of 2d objects. In: European Conference on Computer Vision. pp. 127–145. Springer (2024)

  57. [58]

    In: Rambow, O., Wan- ner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S

    Lei, X., Yang, Z., Chen, X., Li, P., Liu, Y.: Scaffolding coordinates to promote vision-language coordination in large multi-modal models. In: Rambow, O., Wan- ner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S. (eds.) Pro- ceedings of the 31st International Conference on Computational Linguistics. pp. 64 2886–2903. Association for Comp...

  58. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

    Li, B., Ge, Y., Chen, Y., Ge, Y., Zhang, R., Shan, Y.: Seed-bench-2-plus: Bench- marking multimodal large language models with text-rich visual comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

  59. [60]

    In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

    Li, F., Hogg, D.C., Cohn, A.G.: Reframing spatial reasoning evaluation in lan- guage models: a real-world simulation benchmark for qualitative reasoning. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. IJCAI ’24 (2024).https://doi.org/10.24963/ijcai.2024/701, https://doi.org/10.24963/ijcai.2024/701

  60. [61]

    Scientific reports9(1), 13727 (2019)

    Li, H., Liu, N., Li, Y., Weidner, R., Fink, G.R., Chen, Q.: The simon effect based on allocentric and egocentric reference frame: Common and specific neural correlates. Scientific reports9(1), 13727 (2019)

  61. [62]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Li, S., Koh, P.W., Du, S.S.: Exploring how generative MLLMs perceive more than CLIP with the same vision encoder. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers). pp. 10101– 10119. Association for Computational Linguistics, Vienna...

  62. [63]

    In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)

  63. [64]

    Lin, X., Dai, Z., Verma, A., Ng, S.K., Jaillet, P., Low, B.K.H.: Prompt op- timization with human feedback (2025),https://openreview.net/forum?id= UW0zetsx8X

  64. [65]

    In: ECCV(62).pp.36–55(2024),https://doi.org/10.1007/978-3-031-73033-7_3

    Lin,Z.,Liu,D.,Zhang,R.,Gao,P.,Qiu,L.,Xiao,H.,Qiu,H.,Shao,W.,Chen,K., Han,J.,Huang,S.,Zhang,Y.,He,X.,Qiao,Y.,Li,H.:Sphinx:Amixerofweights, visual embeddings and image scales for multi-modal large language models. In: ECCV(62).pp.36–55(2024),https://doi.org/10.1007/978-3-031-73033-7_3

  65. [66]

    Current Biology19(17), 1458–1462 (2009)

    Ling, S., Pearson, J., Blake, R.: Dissociation of neural mechanisms underlying orientation processing in humans. Current Biology19(17), 1458–1462 (2009)

  66. [67]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Liu,H.,Li,C.,Li,Y.,Lee,Y.J.:Improvedbaselineswithvisualinstructiontuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26296–26306 (June 2024)

  67. [68]

    io/blog/2024-01-30-llava-next/

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/

  68. [69]

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

  69. [70]

    Multimedia Tools and Applications83(3), 6909–6924 (2024)

    Liu, R., Yu, Z., Fan, Q., Sun, Q., Jiang, Z.: The improved method in fabric image classification using convolutional neural network. Multimedia Tools and Applications83(3), 6909–6924 (2024)

  70. [71]

    (eds.) Computer Vision – ECCV 2024

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., Lin, D.: Mmbench: Is your multi-modal model an all- around player? In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 216–233. Springer Nature Switzerland, Cham (2025) 65

  71. [72]

    Journal of Cognition6(2023).https://doi.org/ 10.5334/joc.321

    Loy, J.E., Demberg, V.: Individual differences in spatial orientation modulate perspective taking in listeners. Journal of Cognition6(2023).https://doi.org/ 10.5334/joc.321

  72. [73]

    In: International Conference on Computer Vision (ICCV) (2025)

    Ma, W., Chen, H., Zhang, G., Chou, Y.C., de Melo, C.M., Yuille, A.: 3dsrbench: A comprehensive 3d spatial reasoning benchmark. In: International Conference on Computer Vision (ICCV) (2025)

  73. [74]

    The MIT Press, Cambridge, 1st ed

    Mallot, H.A.: From geometry to behavior : an introduction to spatial cognition. The MIT Press, Cambridge, 1st ed. edn. (2023)

  74. [75]

    Advances in neural information processing systems25(2012)

    Mezuman, E., Weiss, Y.: Learning about canonical views from internet image collections. Advances in neural information processing systems25(2012)

  75. [76]

    In: Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cot- terell, R., Chakraborty, T., Zhou, Y

    Mirzaee, R., Rajaby Faghihi, H., Ning, Q., Kordjamshidi, P.: SPARTQA: A textual question answering benchmark for spatial reasoning. In: Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cot- terell, R., Chakraborty, T., Zhou, Y. (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Associati...

  76. [77]

    In: Proceedings of the 37th International Conference on Neural In- formation Processing Systems

    Mu, Y., Zhang, Q., Hu, M., Wang, W., Ding, M., Jin, J., Wang, B., Dai, J., Qiao, Y., Luo, P.: Embodiedgpt: vision-language pre-training via embodied chain of thought. In: Proceedings of the 37th International Conference on Neural In- formation Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

  77. [78]

    Experimental Psychology68, 41–48 (2021).https://doi.org/10.1027/1618-3169/a000505

    Muto, H.: Correlational evidence for the role of spatial perspective-taking ability in the mental rotation of human-like objects. Experimental Psychology68, 41–48 (2021).https://doi.org/10.1027/1618-3169/a000505

  78. [79]

    neelgajare: Rock images (2022),https : / / www . kaggle . com / datasets / neelgajare/rocks-dataset[Accessed: (01-29-2025)]

  79. [80]

    Deep learning for early warning signals of tipping points

    Peer, M., Salomon, R., Goldberg, I., Blanke, O., Arzy, S.: Brain system for men- tal orientation in space, time, and person. Proceedings of the National Academy of Sciences112(35), 11072–11077 (2015).https://doi.org/10.1073/pnas. 1504242112,https://www.pnas.org/doi/abs/10.1073/pnas.1504242112

  80. [81]

    In: Annual Meeting of the As- sociation for Computational Linguistics (2024),https://api.semanticscholar

    Pei, J., Viola, I., Huang, H., Wang, J., Ahsan, M., Ye, F., Jiang, Y., Sai, Y., Wang, D., Chen, Z., Ren, P., César, P.: Autonomous workflow for multimodal fine- grained training assistants towards mixed reality. In: Annual Meeting of the As- sociation for Computational Linguistics (2024),https://api.semanticscholar. org/CorpusID:269982847

Showing first 80 references.