Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs
Pith reviewed 2026-05-19 12:34 UTC · model grok-4.3
The pith
Multi-modal large language models fail to understand object orientations, scoring at most 54.2% on coarse tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DORI establishes that state-of-the-art vision-language models exhibit systematic limitations in orientation comprehension, achieving at most 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments, with performance deteriorating for tasks requiring reference frame shifts or compound rotations, indicating deficiencies in their internal 3D spatial representations.
What carries the argument
DORI benchmark, which assesses four dimensions of orientation comprehension through tasks from 11 datasets across 67 categories in synthetic and real scenarios.
If this is right
- Robotic manipulation would gain from models that correctly track object rotations and alignments.
- Augmented reality interactions could place and manipulate objects more accurately with better orientation awareness.
- 3D scene reconstruction would improve if models could follow orientation changes across different viewpoints.
- Human-AI systems operating in physical spaces would function more reliably with correct orientation judgments.
Where Pith is reading between the lines
- Training models explicitly on rotation and reference-frame examples might reduce the observed errors.
- The same benchmark approach could help diagnose other spatial reasoning gaps beyond orientation.
- Fixing these issues may require new internal representations that treat 3D rotations as a core capability.
Load-bearing premise
The curated tasks drawn from 11 datasets across 67 categories successfully isolate pure orientation comprehension without conflating it with positional relationships or general scene understanding.
What would settle it
A model achieving substantially higher accuracy, such as above 80 percent, on the granular orientation judgment tasks in DORI would indicate that the reported systematic failures are not present.
Figures
read the original abstract
Object orientation understanding represents a fundamental challenge in visual perception critical for applications like robotic manipulation and augmented reality. Current vision-language benchmarks fail to isolate this capability, often conflating it with positional relationships and general scene understanding. We introduce DORI (Discriminative Orientation Reasoning Intelligence), a comprehensive benchmark establishing object orientation perception as a primary evaluation target. DORI assesses four dimensions of orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. Through carefully curated tasks from 11 datasets spanning 67 object categories across synthetic and real-world scenarios, DORI provides insights on how multi-modal systems understand object orientations. Our evaluation of 15 state-of-the-art vision-language models reveals critical limitations: even the best models achieve only 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments, with performance deteriorating for tasks requiring reference frame shifts or compound rotations. These findings demonstrate the need for dedicated orientation representation mechanisms, as models show systematic inability to perform precise angular estimations, track orientation changes across viewpoints, and understand compound rotations - suggesting limitations in their internal 3D spatial representations. As the first diagnostic framework specifically designed for orientation awareness in multimodal systems, DORI offers implications for improving robotic control, 3D scene reconstruction, and human-AI interaction in physical environments. DORI data: https://huggingface.co/datasets/appledora/DORI-Benchmark
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the DORI benchmark to evaluate object orientation understanding in multi-modal large language models (MLLMs). It claims that even the best models achieve only 54.2% accuracy on coarse orientation tasks and 33.0% on granular judgments, with performance worsening for tasks involving reference frame shifts or compound rotations. The benchmark is constructed from tasks curated from 11 datasets spanning 67 object categories in both synthetic and real-world scenarios, assessing four dimensions: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding.
Significance. If the benchmark tasks successfully isolate orientation comprehension, the findings would highlight important gaps in MLLMs' internal 3D spatial representations and motivate dedicated orientation mechanisms. The public release of the DORI dataset supports potential reproducibility and follow-up work in robotics and scene understanding.
major comments (2)
- [Abstract] Abstract: The claim that DORI tasks isolate pure orientation comprehension from positional relationships and general scene understanding lacks any methodological detail on controls for confounds (e.g., object position, viewpoint statistics, lighting, or canonical shape cues). This is load-bearing for the central claim because the reported accuracy drops (54.2% coarse, 33.0% granular) and deterioration on reference-frame shifts cannot be attributed specifically to orientation deficits without such isolation.
- [Abstract] Abstract: No information is provided on task validation, inter-annotator agreement, or statistical significance testing for the accuracy figures and performance trends across the 15 models. This undermines the reliability of the headline conclusion that models exhibit systematic orientation failures.
minor comments (1)
- [Abstract] Abstract: The phrase 'carefully curated tasks' is used without elaboration on the curation criteria or balancing procedures; expanding this in the full manuscript would improve clarity of the benchmark construction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and completeness.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that DORI tasks isolate pure orientation comprehension from positional relationships and general scene understanding lacks any methodological detail on controls for confounds (e.g., object position, viewpoint statistics, lighting, or canonical shape cues). This is load-bearing for the central claim because the reported accuracy drops (54.2% coarse, 33.0% granular) and deterioration on reference-frame shifts cannot be attributed specifically to orientation deficits without such isolation.
Authors: We agree that the abstract, due to length constraints, does not explicitly detail confound controls. The manuscript describes curation from 11 datasets across 67 categories in synthetic and real-world settings to emphasize object-centric views and reduce positional confounds, with varied viewpoints and lighting conditions to limit shape and illumination biases. To strengthen the presentation, we will revise the abstract to include a brief summary of these isolation strategies. revision: yes
-
Referee: [Abstract] Abstract: No information is provided on task validation, inter-annotator agreement, or statistical significance testing for the accuracy figures and performance trends across the 15 models. This undermines the reliability of the headline conclusion that models exhibit systematic orientation failures.
Authors: The abstract prioritizes headline results and does not cover validation procedures. The full manuscript includes task validation steps, inter-annotator agreement metrics, and statistical significance testing for the reported accuracies and trends. We will revise the abstract to add a concise reference to these validation elements for improved transparency. revision: yes
Circularity Check
No circularity: benchmark results derived directly from task evaluations
full rationale
The paper introduces DORI as a new benchmark with tasks curated from 11 public datasets across 67 categories. Reported accuracies (54.2% coarse, 33.0% granular) are direct model evaluations on these tasks, with no equations, fitted parameters, or self-citations appearing in the abstract. Conclusions about orientation limitations follow from the performance numbers without any reduction to inputs by construction, self-definitional loops, or renamed known results. The derivation chain is self-contained as an empirical evaluation framework.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The four dimensions of frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding represent distinct and isolable aspects of object orientation comprehension.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DORI assesses four dimensions of orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding... even the best models achieve only 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Drawing on established frameworks in cognitive neuroscience, DORI decomposes object orientation understanding into the four dimensions... each corresponding to a separable perceptual or reasoning mechanism
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
WildRoadBench provides a professionally annotated UAV corpus and dual-track protocol showing frontier VLMs and LLM agents achieve limited performance on wild aerial road-damage grounding under unified metrics.
-
Why MLLMs Struggle to Determine Object Orientations
Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ahmadyan, A., Zhang, L., Ablavatski, A., Wei, J., Grundmann, M.: Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7822–7831 (2021)
work page 2021
-
[2]
AI, ., :, Young, A., Chen, B., Li, C., Huang, C., Zhang, G., Zhang, G., Li, H., Zhu, J., Chen, J., Chang, J., Yu, K., Liu, P., Liu, Q., Yue, S., Yang, S., Yang, S., Yu, T., Xie, W., Huang, W., Hu, X., Ren, X., Niu, X., Nie, P., Xu, Y., Liu, Y., Wang, Y., Cai, Y., Gu, Z., Liu, Z., Dai, Z.: Yi: Open foundation models by 01.ai (2024)
work page 2024
-
[3]
Artificial Intelligence303, 103637 (2022).https://doi.org/10.1016/j.artint.2021.103637
Alomari, M., Li, F., Hogg, D., Cohn, A.G.: Online perceptual learning and natural language acquisition for autonomous robots. Artificial Intelligence303, 103637 (2022).https://doi.org/10.1016/j.artint.2021.103637
-
[4]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Azuma, D., Miyanishi, T., Kurita, S., Kawanabe, M.: Scanqa: 3d question answer- ing for spatial scene understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
work page 2022
-
[5]
Vision Research51, 470–478 (2011).https://doi
Baker, T.J., Norcia, A.M., Candy, T.R.: Orientation tuning in the visual cortex of 3-month-old human infants. Vision Research51, 470–478 (2011).https://doi. org/10.1016/j.visres.2011.01.003
-
[6]
Ballaz, C., Boutsen, L., Peyrin, C., Humphreys, G.W., Marendaz, C.: Visual searchforobjectorientationcanbemodulatedbycanonicalorientation.Journalof Experimental Psychology: Human Perception and Performance31(1), 20 (2005)
work page 2005
-
[7]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al.: Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Advances in child development and behavior25, 157–199 (1994)
Blades, M., Spencer, C.: The development of children’s ability to use spatial rep- resentations. Advances in child development and behavior25, 157–199 (1994)
work page 1994
-
[9]
Vision Research45(25), 3169–3179 (2005)
Braddick, O., Birtles, D., Wattam-Bell, J., Atkinson, J.: Motion- and orientation- specific cortical responses in infancy. Vision Research45(25), 3169–3179 (2005). https://doi.org/https://doi.org/10.1016/j.visres.2005.07.021,https: //www.sciencedirect.com/science/article/pii/S0042698905003846
-
[10]
Cognitive psychology105, 9–38 (2018)
Bramley, N.R., Gerstenberg, T., Tenenbaum, J.B., Gureckis, T.M.: Intuitive ex- perimentation in the physical world. Cognitive psychology105, 9–38 (2018)
work page 2018
-
[11]
In: NERA Conference Proceedings 2024 (2024)
Castle, C.: Using design thinking to create human-centered assessments. In: NERA Conference Proceedings 2024 (2024)
work page 2024
-
[12]
Chavez, R.S., Guthrie, T.D., Kapustka, J.M.: Independent allocentric and ego- centric reference frame encoding of person knowledge within the brains of social groups. bioRxiv pp. 2023–09 (2023)
work page 2023
-
[13]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14455–14465 (June 2024)
work page 2024
-
[14]
Chen, Y.L., Cai, Y.R., Cheng, M.Y.: Vision-based robotic object grasping—a deep reinforcement learning approach. Machines11(2), 275 (2023)
work page 2023
-
[15]
Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. In: NeurIPS (2024)
work page 2024
-
[16]
Advances in Neural Information Processing Systems37, 135062–135093 (2025) 61
Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. Advances in Neural Information Processing Systems37, 135062–135093 (2025) 61
work page 2025
-
[17]
KI-97: Advances in Artificial Intelligence pp
Cohn, A.G.: Qualitative spatial representation and reasoning techniques. KI-97: Advances in Artificial Intelligence pp. 1–30 (1997).https://doi.org/10.1007/ 3540634932_1
work page 1997
-
[18]
IEEE Transactions on Cybernetics53(3), 1682–1698 (2021)
Cong, Y., Chen, R., Ma, B., Liu, H., Hou, D., Yang, C.: A comprehensive study of 3-d vision-based robot manipulation. IEEE Transactions on Cybernetics53(3), 1682–1698 (2021)
work page 2021
-
[19]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)
work page 2016
-
[20]
In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)
Corsetti, J., Boscaini, D., Oh, C., Cavallaro, A., Poiesi, F.: Open-vocabulary ob- ject 6d pose estimation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 18071–18080 (June 2024)
work page 2024
-
[21]
Creativity Research Journal35, 23–32 (2022).https://doi.org/10.1080/10400419.2022.2049532
Cortes, R.A., Colaizzi, G.A., Dyke, E.L., Peterson, E.G., Walker, D.L., Kolvoord, R.A., Uttal, D.H., Green, A.E.: Individual differences in parietal and premotor activity during spatial cognition predict figural creativity. Creativity Research Journal35, 23–32 (2022).https://doi.org/10.1080/10400419.2022.2049532
-
[22]
Comprehensive Physiology8(4), 1575 (2018)
Delhaye, B.P., Long, K.H., Bensmaia, S.J.: Neural basis of touch and propriocep- tion in primate cortex. Comprehensive Physiology8(4), 1575 (2018)
work page 2018
-
[23]
Deng, Q., Deb, O., Patel, A., Rupprecht, C., Torr, P., Trigoni, N., Markham, A.: Towards multi-modal animal pose estimation: An in-depth analysis. CoRR abs/2410.09312(2024)
-
[24]
In: 2020 IEEE International Conference on Robotics and Automation (ICRA)
Deng,X.,Xiang,Y.,Mousavian,A.,Eppner,C.,Bretl,T.,Fox,D.:Self-supervised 6d object pose estimation for robot manipulation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 3665–3671. IEEE (2020)
work page 2020
-
[25]
Cerebral Cortex33(22), 11146–11156 (2023)
Doganci, N., Iannotti, G.R., Coll, S.Y., Ptak, R.: How embodied is cognition? fmri and behavioral evidence for common neural resources underlying motor plan- ning and mental rotation of bodily stimuli. Cerebral Cortex33(22), 11146–11156 (2023)
work page 2023
-
[26]
Du, G., Wang, K., Lian, S., Zhao, K.: Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review. Artif. Intell. Rev.54(3), 1677–1734 (Mar 2021)
work page 2021
-
[27]
Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y
Du, M., Wu, B., Li, Z., Huang, X., Wei, Z.: EmbSpatial-bench: Benchmark- ing spatial understanding for embodied tasks with large vision-language mod- els. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 346–355. Association for Computationa...
-
[28]
In: Proceedings of the European conference on computer vision (ECCV)
Fabbri,M.,Lanzi,F.,Calderara,S.,Palazzi,A.,Vezzani,R.,Cucchiara,R.:Learn- ing to detect and track visible and occluded body joints in a virtual world. In: Proceedings of the European conference on computer vision (ECCV). pp. 430–446 (2018)
work page 2018
-
[29]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Feng, Y., Lin, J., Dwivedi, S.K., Sun, Y., Patel, P., Black, M.J.: Chatpose: Chat- ting about 3d human pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2093–2103 (June 2024)
work page 2093
-
[30]
Trends in cognitive sciences18(10), 536–542 (2014)
Frick, A., Möhring, W., Newcombe, N.S.: Development of mental transformation abilities. Trends in cognitive sciences18(10), 536–542 (2014)
work page 2014
-
[31]
Fu, H., Cohen-Or, D., Dror, G., Sheffer, A.: Upright orientation of man-made objects. In: ACM SIGGRAPH 2008 Papers. SIGGRAPH ’08, Association for 62 Computing Machinery, New York, NY, USA (2008).https://doi.org/10.1145/ 1399504.1360641,https://doi.org/10.1145/1399504.1360641
-
[32]
International Journal of Computer Vision129, 3313–3337 (2021)
Fu, H., Jia, R., Gao, L., Gong, M., Zhao, B., Maybank, S., Tao, D.: 3d-future: 3d furniture shape with texture. International Journal of Computer Vision129, 3313–3337 (2021)
work page 2021
-
[34]
In: European Conference on Computer Vision
Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not perceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)
work page 2024
-
[35]
Ganesan, M., Kandhasamy, S., Chokkalingam, B., Mihet-Popa, L.: A comprehen- sive review on deep learning-based motion planning and end-to-end learning for self-driving vehicle. IEEE Access (2024)
work page 2024
-
[36]
Gao, J., Sarkar, B., Xia, F., Xiao, T., Wu, J., Ichter, B., Majumdar, A., Sadigh, D.:Physicallygroundedvision-languagemodels for roboticmanipulation.In:2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 12462– 12469 (2024).https://doi.org/10.1109/ICRA57147.2024.10610090
-
[37]
Advances In Neural Information Processing Systems35, 31841– 31854 (2022)
Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S.: Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems35, 31841– 31854 (2022)
work page 2022
-
[38]
In: 2012 IEEE conference on computer vision and pattern recognition
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 3354–3361. IEEE (2012)
work page 2012
-
[39]
Human brain mapping33(4), 895–908 (2012)
Goble, D.J., Coxon, J.P., Van Impe, A., Geurts, M., Van Hecke, W., Sunaert, S., Wenderoth, N., Swinnen, S.P.: The neural basis of central proprioceptive process- ing in older versus younger adults: an important sensory role for right putamen. Human brain mapping33(4), 895–908 (2012)
work page 2012
-
[40]
Nature Neu- roscience10, 512–522 (2007).https://doi.org/10.1038/nn1865
Golarai, G., Ghahremani, D.G., Whitfield-Gabrieli, S., Reiss, A.L., Eberhardt, J.L., Gabrieli, J.D.E., Grill-Spector, K.: Differential development of high-level visual cortex correlates with category-specific recognition memory. Nature Neu- roscience10, 512–522 (2007).https://doi.org/10.1038/nn1865
-
[41]
In: International Conference on Soft Computing and Pattern Recognition
Gontier, C., Jordan, J., Petrovici, M.A.: Delaunay: A dataset of abstract art for psychophysical and machine learning research. In: International Conference on Soft Computing and Pattern Recognition. pp. 134–143. Springer (2023)
work page 2023
-
[42]
Journal of cognitive neuroscience22(12), 2836–2849 (2010)
Gramann, K., Onton, J., Riccobon, D., Mueller, H.J., Bardins, S., Makeig, S.: Human brain dynamics accompanying use of egocentric and allocentric reference frames during navigation. Journal of cognitive neuroscience22(12), 2836–2849 (2010)
work page 2010
-
[43]
In: 2024 IEEE International Conference on Robotics and Automation (ICRA) (2024)
Guan, T., Yang, Y., Cheng, H., Lin, M., Kim, R., Madhivanan, R., Sen, A., Manocha, D.: LOC-ZSON: language-driven object-centric zero-shot object re- trieval and navigation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA) (2024)
work page 2024
-
[44]
Guibas, L.J.: Collaborative Research: CI-P: ShapeNet: An Information-Rich 3D Model Repository for Graphics, Vision and Robotics Research. NSF Award Num- ber 1729205. Directorate for Computer and Information Science and Engineering, Division Of Computer and Network Systems. 2017. (Sep 2017) 63
work page 2017
-
[45]
In: Proceedings of the 33rd annual con- ference of the cognitive science society
Hamrick, J., Battaglia, P., Tenenbaum, J.B.: Internal physics models guide proba- bilistic judgments about object dynamics. In: Proceedings of the 33rd annual con- ference of the cognitive science society. vol. 2. Cognitive Science Society Austin, TX (2011)
work page 2011
-
[46]
Psychonomic Bulletin & Review31(4), 1503–1515 (2024)
Harris, I.M.: Interpreting the orientation of objects: A cross-disciplinary review. Psychonomic Bulletin & Review31(4), 1503–1515 (2024)
work page 2024
-
[47]
arXiv preprint arXiv:2412.05127 (2024)
Hewing, M., Leinhos, V.: The prompt canvas: a literature-based practitioner guide for creating effective prompts in large language models. arXiv preprint arXiv:2412.05127 (2024)
-
[48]
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)
work page 2022
-
[49]
In: Ku, L.W., Martins, A., Srikumar, V
Huang, W., Liu, H., Guo, M., Gong, N.: Visual hallucinations of multi-modal large language models. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguistics: ACL 2024. pp. 9614–9631. Association for Computational Linguistics, Bangkok, Thailand (Aug 2024).https://doi. org/10.18653/v1/2024.findings-acl.573,http...
-
[50]
Foundations and trends®in computer graphics and vision12(1–3), 1–308 (2020)
Janai, J., Güney, F., Behl, A., Geiger, A., et al.: Computer vision for autonomous vehicles: Problems, datasets and state of the art. Foundations and trends®in computer graphics and vision12(1–3), 1–308 (2020)
work page 2020
-
[51]
Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., Yi, L.: Omnis- patial: Towards comprehensive spatial reasoning benchmark for vision language models.In:TheFourteenthInternationalConferenceonLearningRepresentations (2026),https://openreview.net/forum?id=6nZKT2rL0H
work page 2026
-
[52]
Transactions on Machine Learning Research2024(2024),https://openreview.net/forum?id=skLtdUVaJa
Jiang, D., He, X., Zeng, H., Wei, C., Ku, M.W., Liu, Q., Chen, W.: Mantis: Interleaved multi-image instruction tuning. Transactions on Machine Learning Research2024(2024),https://openreview.net/forum?id=skLtdUVaJa
work page 2024
-
[53]
In: 2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)
Jung, J.H., Kim, E.T., Kim, S., Lee, J.H., Kim, B., Chang, B.: Is ‘right’right? enhancing object orientation understanding in multimodal large language models through egocentric instruction tuning. In: 2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 14257–14267. IEEE (2025)
work page 2025
-
[54]
Journal of Vi- sion23(10), 9–9 (2023)
Kallmayer, A., Võ, M.L.H., Draschkow, D.: Viewpoint dependence and scene con- text effects generalize to depth rotated three-dimensional objects. Journal of Vi- sion23(10), 9–9 (2023)
work page 2023
-
[55]
Neurosci5, 713–728 (2024).https://doi.org/10.3390/neurosci5040050
Kostakos, K., Pliakopanou, A., Meimaridis, V., Galanou, O., Anagnostou, A., Ser- tidou, D., Katis, P., Anastasiou, P., Katsoulidis, K., Lykogiorgos, Y., Mytilinaios, D., Katsenos, A.P., Simos, Y.V., Bellos, S., Konitsiotis, S., Peschos, D., Tsamis, K.I.: Development of spatial memory: A behavioral study. Neurosci5, 713–728 (2024).https://doi.org/10.3390/n...
-
[56]
Visual Cognition9(1-2), 248–264 (2002).https://doi.org/ 10.1080/13506280143000421
Kourtzi, Z., Nakayama, K.: Distinct mechanisms for the representation of moving and static objects. Visual Cognition9(1-2), 248–264 (2002).https://doi.org/ 10.1080/13506280143000421
-
[57]
In: European Conference on Computer Vision
Krishnan, A., Kundu, A., Maninis, K.K., Hays, J., Brown, M.: Omninocs: A uni- fied nocs dataset and model for 3d lifting of 2d objects. In: European Conference on Computer Vision. pp. 127–145. Springer (2024)
work page 2024
-
[58]
In: Rambow, O., Wan- ner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S
Lei, X., Yang, Z., Chen, X., Li, P., Liu, Y.: Scaffolding coordinates to promote vision-language coordination in large multi-modal models. In: Rambow, O., Wan- ner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S. (eds.) Pro- ceedings of the 31st International Conference on Computational Linguistics. pp. 64 2886–2903. Association for Comp...
work page 2025
-
[59]
Li, B., Ge, Y., Chen, Y., Ge, Y., Zhang, R., Shan, Y.: Seed-bench-2-plus: Bench- marking multimodal large language models with text-rich visual comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)
work page 2024
-
[60]
In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Li, F., Hogg, D.C., Cohn, A.G.: Reframing spatial reasoning evaluation in lan- guage models: a real-world simulation benchmark for qualitative reasoning. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. IJCAI ’24 (2024).https://doi.org/10.24963/ijcai.2024/701, https://doi.org/10.24963/ijcai.2024/701
-
[61]
Scientific reports9(1), 13727 (2019)
Li, H., Liu, N., Li, Y., Weidner, R., Fink, G.R., Chen, Q.: The simon effect based on allocentric and egocentric reference frame: Common and specific neural correlates. Scientific reports9(1), 13727 (2019)
work page 2019
-
[62]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Li, S., Koh, P.W., Du, S.S.: Exploring how generative MLLMs perceive more than CLIP with the same vision encoder. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers). pp. 10101– 10119. Association for Computational Linguistics, Vienna...
-
[63]
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)
work page 2014
-
[64]
Lin, X., Dai, Z., Verma, A., Ng, S.K., Jaillet, P., Low, B.K.H.: Prompt op- timization with human feedback (2025),https://openreview.net/forum?id= UW0zetsx8X
work page 2025
-
[65]
In: ECCV(62).pp.36–55(2024),https://doi.org/10.1007/978-3-031-73033-7_3
Lin,Z.,Liu,D.,Zhang,R.,Gao,P.,Qiu,L.,Xiao,H.,Qiu,H.,Shao,W.,Chen,K., Han,J.,Huang,S.,Zhang,Y.,He,X.,Qiao,Y.,Li,H.:Sphinx:Amixerofweights, visual embeddings and image scales for multi-modal large language models. In: ECCV(62).pp.36–55(2024),https://doi.org/10.1007/978-3-031-73033-7_3
-
[66]
Current Biology19(17), 1458–1462 (2009)
Ling, S., Pearson, J., Blake, R.: Dissociation of neural mechanisms underlying orientation processing in humans. Current Biology19(17), 1458–1462 (2009)
work page 2009
-
[67]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Liu,H.,Li,C.,Li,Y.,Lee,Y.J.:Improvedbaselineswithvisualinstructiontuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26296–26306 (June 2024)
work page 2024
-
[68]
io/blog/2024-01-30-llava-next/
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/
work page 2024
-
[69]
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)
work page 2023
-
[70]
Multimedia Tools and Applications83(3), 6909–6924 (2024)
Liu, R., Yu, Z., Fan, Q., Sun, Q., Jiang, Z.: The improved method in fabric image classification using convolutional neural network. Multimedia Tools and Applications83(3), 6909–6924 (2024)
work page 2024
-
[71]
(eds.) Computer Vision – ECCV 2024
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., Lin, D.: Mmbench: Is your multi-modal model an all- around player? In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 216–233. Springer Nature Switzerland, Cham (2025) 65
work page 2024
-
[72]
Journal of Cognition6(2023).https://doi.org/ 10.5334/joc.321
Loy, J.E., Demberg, V.: Individual differences in spatial orientation modulate perspective taking in listeners. Journal of Cognition6(2023).https://doi.org/ 10.5334/joc.321
-
[73]
In: International Conference on Computer Vision (ICCV) (2025)
Ma, W., Chen, H., Zhang, G., Chou, Y.C., de Melo, C.M., Yuille, A.: 3dsrbench: A comprehensive 3d spatial reasoning benchmark. In: International Conference on Computer Vision (ICCV) (2025)
work page 2025
-
[74]
The MIT Press, Cambridge, 1st ed
Mallot, H.A.: From geometry to behavior : an introduction to spatial cognition. The MIT Press, Cambridge, 1st ed. edn. (2023)
work page 2023
-
[75]
Advances in neural information processing systems25(2012)
Mezuman, E., Weiss, Y.: Learning about canonical views from internet image collections. Advances in neural information processing systems25(2012)
work page 2012
-
[76]
Mirzaee, R., Rajaby Faghihi, H., Ning, Q., Kordjamshidi, P.: SPARTQA: A textual question answering benchmark for spatial reasoning. In: Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cot- terell, R., Chakraborty, T., Zhou, Y. (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Associati...
-
[77]
In: Proceedings of the 37th International Conference on Neural In- formation Processing Systems
Mu, Y., Zhang, Q., Hu, M., Wang, W., Ding, M., Jin, J., Wang, B., Dai, J., Qiao, Y., Luo, P.: Embodiedgpt: vision-language pre-training via embodied chain of thought. In: Proceedings of the 37th International Conference on Neural In- formation Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)
work page 2023
-
[78]
Experimental Psychology68, 41–48 (2021).https://doi.org/10.1027/1618-3169/a000505
Muto, H.: Correlational evidence for the role of spatial perspective-taking ability in the mental rotation of human-like objects. Experimental Psychology68, 41–48 (2021).https://doi.org/10.1027/1618-3169/a000505
-
[79]
neelgajare: Rock images (2022),https : / / www . kaggle . com / datasets / neelgajare/rocks-dataset[Accessed: (01-29-2025)]
work page 2022
-
[80]
Deep learning for early warning signals of tipping points
Peer, M., Salomon, R., Goldberg, I., Blanke, O., Arzy, S.: Brain system for men- tal orientation in space, time, and person. Proceedings of the National Academy of Sciences112(35), 11072–11077 (2015).https://doi.org/10.1073/pnas. 1504242112,https://www.pnas.org/doi/abs/10.1073/pnas.1504242112
-
[81]
Pei, J., Viola, I., Huang, H., Wang, J., Ahsan, M., Ye, F., Jiang, Y., Sai, Y., Wang, D., Chen, Z., Ren, P., César, P.: Autonomous workflow for multimodal fine- grained training assistants towards mixed reality. In: Annual Meeting of the As- sociation for Computational Linguistics (2024),https://api.semanticscholar. org/CorpusID:269982847
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.