Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Pith reviewed 2026-05-22 09:23 UTC · model grok-4.3
The pith
Multimodal large language models think in space from videos but stay subhuman, limited mainly by spatial reasoning while cognitive maps help distance tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MLLMs exhibit competitive though subhuman visual-spatial intelligence on VSI-Bench; spatial reasoning remains the primary bottleneck, local world models and spatial awareness emerge, linguistic reasoning techniques fail to improve performance, and explicitly generating cognitive maps enhances spatial distance ability.
What carries the argument
VSI-Bench, a video-based visual-spatial intelligence benchmark of over 5,000 question-answer pairs, together with explicit cognitive map generation as a reasoning step that improves distance judgments.
If this is right
- Spatial reasoning, not perception or memory, sets the main limit on MLLM performance for space-related tasks.
- Local world models and spatial awareness form inside MLLMs from video training alone.
- Chain-of-thought, self-consistency, and tree-of-thoughts produce no gains on these spatial questions.
- Forcing explicit cognitive map output specifically boosts accuracy on spatial distance items.
Where Pith is reading between the lines
- Future model designs could add dedicated modules for maintaining and updating spatial maps to close the remaining gap with humans.
- The same video training that produces these abilities might already support downstream uses such as navigation planning or scene reconstruction.
- New tests could check whether the observed spatial awareness transfers to physical robots or to environments never seen during pretraining.
Load-bearing premise
The VSI-Bench questions and evaluation protocol accurately measure human-like visual-spatial intelligence rather than surface-level pattern matching or dataset artifacts.
What would settle it
Human participants scoring only marginally above the best MLLMs on the full VSI-Bench, or MLLMs maintaining high scores on distance questions even when prevented from generating cognitive maps.
read the original abstract
Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VSI-Bench, a new video-based benchmark with over 5,000 QA pairs, to evaluate visual-spatial intelligence in MLLMs. It reports that current MLLMs achieve competitive but subhuman performance on the benchmark, identifies spatial reasoning as the primary bottleneck, shows that linguistic techniques such as chain-of-thought fail to help, and finds that explicitly generating cognitive maps during inference improves performance on spatial distance tasks.
Significance. If the benchmark accurately isolates sequential spatial reasoning, the work supplies a useful new evaluation resource and empirical evidence that explicit cognitive-map generation can mitigate a key limitation in MLLMs. The new data collection and the contrast between failed linguistic interventions and successful map generation are concrete strengths that could inform future model design for better world modeling.
major comments (3)
- [§3] §3 (VSI-Bench construction): the paper provides no details on question validation, inter-annotator agreement, or explicit controls that would rule out solutions via frame-level object statistics, temporal co-occurrence patterns, or linguistic priors rather than genuine sequential spatial reconstruction.
- [§5] §5 (Experiments and results): performance gaps and the reported benefit of cognitive-map generation are presented without statistical significance tests, without ablations that hold video length and model scale constant, and without human baselines collected under matched conditions.
- [§6] §6 (Analysis of reasoning strategies): the claim that spatial reasoning remains the primary bottleneck rests on the assumption that VSI-Bench items cannot be solved by surface cues; no diagnostic experiments or error analysis are supplied to support this assumption.
minor comments (2)
- [Figures] Figure captions and axis labels in several result plots omit units or confidence intervals, reducing readability.
- [Related Work] A small number of recent works on spatial reasoning benchmarks for vision-language models are missing from the related-work section.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [§3] §3 (VSI-Bench construction): the paper provides no details on question validation, inter-annotator agreement, or explicit controls that would rule out solutions via frame-level object statistics, temporal co-occurrence patterns, or linguistic priors rather than genuine sequential spatial reconstruction.
Authors: We appreciate the referee's emphasis on rigorous benchmark validation. The original manuscript outlines the data collection pipeline in Section 3 but does not provide sufficient specifics on validation steps. In the revised manuscript, we will expand this section to report: (i) the multi-stage question validation process involving independent annotators, (ii) inter-annotator agreement statistics such as Cohen's kappa, and (iii) explicit filtering and control measures (e.g., adversarial question design and statistical checks) intended to prevent solutions based on single-frame cues, temporal co-occurrences, or linguistic shortcuts. These additions will more clearly establish that successful performance requires sequential spatial reconstruction. revision: yes
-
Referee: [§5] §5 (Experiments and results): performance gaps and the reported benefit of cognitive-map generation are presented without statistical significance tests, without ablations that hold video length and model scale constant, and without human baselines collected under matched conditions.
Authors: We agree that stronger statistical support and controlled comparisons would improve the results section. In the revision we will: (1) include statistical significance tests (paired t-tests or appropriate non-parametric equivalents) for key performance differences, (2) add or clarify ablations that hold video length and model scale fixed where feasible, and (3) provide additional details on the human baseline collection protocol, including the number of participants, instructions given, and how conditions were matched to model evaluations. These changes will allow readers to better assess the reliability of the reported gaps and the benefit of cognitive-map generation. revision: yes
-
Referee: [§6] §6 (Analysis of reasoning strategies): the claim that spatial reasoning remains the primary bottleneck rests on the assumption that VSI-Bench items cannot be solved by surface cues; no diagnostic experiments or error analysis are supplied to support this assumption.
Authors: We acknowledge that the current analysis would benefit from more direct evidence that surface cues are insufficient. We will add a dedicated subsection presenting diagnostic experiments (e.g., performance on variants where spatial relations are perturbed while preserving low-level statistics) together with a systematic error analysis that categorizes model failures according to whether they stem from spatial mis-reasoning versus other factors. This will provide empirical support for identifying spatial reasoning as the primary bottleneck. revision: yes
Circularity Check
Empirical benchmark study with no load-bearing circularity
full rationale
The paper introduces VSI-Bench (a new video-based visual-spatial intelligence benchmark with over 5,000 QA pairs) and reports empirical evaluations of MLLMs on it, including probes for linguistic vs. visual thinking and the effect of explicit cognitive map generation. No equations, fitted parameters, or derivation chains exist that reduce the reported performance numbers or intervention benefits to prior self-citations by construction. Central claims rest on fresh data collection and controlled experiments rather than self-referential definitions or uniqueness theorems imported from the authors' prior work. Any self-citations present are incidental and non-load-bearing for the benchmark results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The constructed VSI-Bench questions validly measure visual-spatial intelligence comparable to human performance
Lean theorems connected to this paper
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs... MLLMs exhibit competitive - though subhuman - visual-spatial intelligence... explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 24 Pith papers
-
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
SpaceDG introduces the first large-scale degradation-aware spatial reasoning dataset using 3D Gaussian Splatting synthesis, showing that visual degradations impair MLLM performance but finetuning on the data improves ...
-
Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.
-
SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?
SaaS-Bench provides 106 realistic professional tasks across 23 deployable SaaS platforms to evaluate LLM-based agents, finding that even the strongest models complete fewer than 4% of tasks end-to-end.
-
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
-
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
-
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
-
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
-
Video-ToC: Video Tree-of-Cue Reasoning
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks ...
-
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.
-
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
-
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.
-
Grounded Reinforcement Learning for Visual Reasoning
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
-
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.
-
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...
-
LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map
LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
-
Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning
Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022. 2
work page 2022
- [2]
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 2, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InNeurIPS,
-
[6]
Large linguistic models: Analyzing theoretical linguistic abilities of llms
Gašper Beguš, Maksymilian D ˛ abkowski, and Ryan Rhodes. Large linguistic models: Analyzing theoretical linguistic abilities of llms. arXiv preprint arXiv:2305.00948 , 2023. 2
-
[7]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023. 2
work page 2023
-
[8]
Rt-1: Robotics transformer for real-world con- trol at scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world con- trol at scale. In RSS, 2023. 2
work page 2023
-
[9]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, S...
work page 2020
-
[10]
Spa- tialbot: Precise spatial understanding with vision language models
Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xi- aoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spa- tialbot: Precise spatial understanding with vision language models. arXiv preprint arXiv:2406.13642, 2024. 8
-
[11]
Spatial and object visualization cognitive styles: Validation studies in 3800 individuals
Christopher F Chabris, Thomas E Jerde, Anita W Woolley, Margaret E Gerbasi, Jonathon P Schuldt, Sean L Bennett, J Richard Hackman, and Stephen M Kosslyn. Spatial and object visualization cognitive styles: Validation studies in 3800 individuals. Group brain technical report , 2:1–20,
-
[12]
Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Fei-Fei Li. Hourvideo: 1-hour video-language understanding. In NeurIPS, 2024. 2, 17
work page 2024
-
[13]
Spatialvlm: Endow- ing vision-language models with spatial reasoning capabil- ities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabil- ities. In CVPR, 2024. 8
work page 2024
-
[14]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open- source suites. arXiv preprint arXiv:2404.16821, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 2
work page 2024
-
[16]
Spatialrgpt: Grounded spatial reasoning in vision-language models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. In NeurIPS, 2024. 8
work page 2024
-
[17]
Spatially-aware transformers for embodied agents
Junmo Cho, Jaesik Yoon, and Sungjin Ahn. Spatially-aware transformers for embodied agents. In ICLR, 2023. 8
work page 2023
-
[18]
James M. Clark and Allan Paivio. Dual coding theory and education. Educational Psychology Review, 3(3):149–210,
-
[19]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 3, 13
work page 2017
-
[20]
Milton J. Dehn. Working Memory and Academic Learning: Assessment and Intervention. John Wiley & Sons, 2011. 3
work page 2011
-
[21]
Palm-e: An embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. In ICML, 2023. 2, 8
work page 2023
-
[22]
The pascal visual object classes (voc) challenge
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010. 4
work page 2010
-
[23]
Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing
Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing. In NeurIPS, 2024. 8
work page 2024
-
[24]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 4, 7, 8, 16
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Self- explanation prompting improves dialogue understanding in large language models
Haoyu Gao, Ting-En Lin, Hangyu Li, Min Yang, Yuchuan Wu, Wentao Ma, Fei Huang, and Yongbin Li. Self- explanation prompting improves dialogue understanding in large language models. In COLING, 2024. 5
work page 2024
-
[26]
Frames of Mind: The Theory of Multi- ple Intelligences
Howard Gardner. Frames of Mind: The Theory of Multi- ple Intelligences. Basic Books, tenth-anniversary edition, second paperback edition edition, 1983. 2
work page 1983
-
[27]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022. 2
work page 2022
-
[28]
A real-world webagent with planning, long context under- standing, and program synthesis
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Saf- dari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context under- standing, and program synthesis. In ICLR, 2024. 2
work page 2024
-
[29]
Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. arXiv preprint arXiv:2411.14794, 2024. 8
-
[30]
Masked autoencoders are scal- able vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scal- able vision learners. In CVPR, 2022. 8
work page 2022
-
[31]
Mea- suring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. In ICLR,
-
[32]
Can large language models explain themselves? a study of llm-generated self- explanations
Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, and Leilani H Gilpin. Can large language models explain themselves? a study of llm-generated self- explanations. arXiv preprint arXiv:2310.11207, 2023. 5
-
[33]
Language models as zero-shot planners: Ex- tracting actionable knowledge for embodied agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Ex- tracting actionable knowledge for embodied agents. In ICML, 2022. 2, 6
work page 2022
-
[34]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2, 4, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024. 2, 6
work page 2024
-
[36]
Language models with rationality
Nora Kassner, Oyvind Tafjord, Ashish Sabharwal, Kyle Richardson, Hinrich Schütze, and Peter Clark. Language models with rationality. In EMNLP, 2023. 2
work page 2023
-
[37]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. In CoRL,
-
[38]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In NeurIPS, 2022. 7, 15
work page 2022
-
[39]
Seed-bench: Bench- marking multimodal large language models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. In CVPR,
-
[40]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 4, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Topviewrs: Vision-language models as top-view spatial reasoners
Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli ´c. Topviewrs: Vision-language models as top-view spatial reasoners. arXiv preprint arXiv:2406.02537, 2024. 8
-
[42]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 2
work page 2023
-
[43]
Mvbench: A comprehensive multi-modal video under- standing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. In CVPR, 2024. 8
work page 2024
-
[44]
Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models
Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu Sun, and Lu Hou. Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models. arXiv preprint arXiv:2311.17404, 2023. 8
-
[45]
Vila: On pre-training for vi- sual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. In CVPR, 2024. 4
work page 2024
-
[46]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 4
work page 2014
-
[47]
Multi- modal situated reasoning in 3d scenes
Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiao- jian Shawn Ma, Baoxiong Jia, and Siyuan Huang. Multi- modal situated reasoning in 3d scenes. Advances in Neu- ral Information Processing Systems , 37:140903–140936,
-
[48]
Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model
Benlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yan- song Tang, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model. arXiv preprint arXiv:2408.00754,
-
[49]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 2024. 2, 17
work page 2024
-
[50]
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Temp- Compass: Do video LLMs really understand videos? In Findings of ACL, 2024
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Temp- Compass: Do video LLMs really understand videos? In Findings of ACL, 2024. 8
work page 2024
-
[52]
Mmbench: Is your multi- modal model an all-around player? In ECCV, 2025
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? In ECCV, 2025. 8
work page 2025
-
[53]
Faithful chain-of-thought reasoning
Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison- Burch. Faithful chain-of-thought reasoning. In ACL, 2023. 5
work page 2023
-
[54]
Openeqa: Embodied question answering in the era of foun- dation models
Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foun- dation models. In CVPR, 2024. 8, 17
work page 2024
-
[55]
Egoschema: A diagnostic benchmark for very long- form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. NeurIPS, 2023. 2, 8, 16
work page 2023
-
[56]
Julia McAfoose and Bernhard T. Baune. Exploring vi- sual–spatial working memory: A critical review of concepts and models. Neuropsychology Review, 2009. 3
work page 2009
-
[57]
Individual dif- ferences in navigation: an introductory overview
Chiara Meneghetti, Laura Miola, Tommaso Feraco, Veron- ica Muffato, and Tommaso Feraco Miola. Individual dif- ferences in navigation: an introductory overview. Prime archives in psychology, 2022. 2
work page 2022
-
[58]
Evaluating cognitive maps and planning in large language models with cogeval
Ida Momennejad, Hosein Hasanbeig, Felipe Vieira Fru- jeri, Hiteshi Sharma, Nebojsa Jojic, Hamid Palangi, Robert Ness, and Jonathan Larson. Evaluating cognitive maps and planning in large language models with cogeval. NeurIPS,
-
[59]
Embodiedgpt: Vision-language pre-training via embodied chain of thought
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. NeurIPS, 2024. 8
work page 2024
-
[60]
The Hippocampus and Context Revisited
Lynn Nadel. The Hippocampus and Context Revisited. Ox- ford University Press, 2008. 7
work page 2008
-
[61]
A Comprehensive Overview of Large Language Models
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham- mad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehen- sive overview of large language models. arXiv preprint arXiv:2307.06435, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [62]
-
[63]
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluat- ing video-based large language models. arXiv preprint arXiv:2311.16103, 2023. 8
-
[64]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023. 2, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El- Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patric...
work page 2024
-
[66]
On measuring faith- fulness or self-consistency of natural language explana- tions
Letitia Parcalabescu and Anette Frank. On measuring faith- fulness or self-consistency of natural language explana- tions. In ACL, 2024. 5
work page 2024
-
[67]
Improving language understanding by gen- erative pre-training
Alec Radford. Improving language understanding by gen- erative pre-training. OpenAI Blog, 2018. 2, 8
work page 2018
-
[68]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,
-
[69]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 8
work page 2021
-
[70]
Does spatial cog- nition emerge in frontier models? arXiv preprint arXiv:2410.06468, 2024
Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl, and Vladlen Koltun. Does spatial cog- nition emerge in frontier models? arXiv preprint arXiv:2410.06468, 2024. 8
-
[71]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. In KDD, 2016. 5
work page 2016
-
[72]
Julia Rozanova, Deborah Ferreira, Krishna Dubba, Weiwei Cheng, Dell Zhang, and Andre Freitas. Grounding natu- ral language instructions: Can large language models cap- ture spatial information? arXiv preprint arXiv:2109.08634,
-
[73]
Gerard Salton and Michael J. McGill. Introduction to Mod- ern Information Retrieval. McGraw-Hill, Inc., USA, 1986. 4
work page 1986
-
[74]
Mental rotation: ef- fects of dimensionality of objects and type of task
Shenna Shepard and Douglas Metzler. Mental rotation: ef- fects of dimensionality of objects and type of task. Journal of experimental psychology: Human perception and perfor- mance, 14(1):3, 1988. 2
work page 1988
-
[75]
Vipergpt: Visual inference via python execution for reasoning
Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. In ICCV, 2023. 6
work page 2023
-
[76]
Yihong Tang, Ao Qu, Zhaokai Wang, Dingyi Zhuang, Zhaofeng Wu, Wei Ma, Shenhao Wang, Yunhan Zheng, Zhan Zhao, and Jinhua Zhao. Sparkle: Mastering basic spatial capabilities in vision language models elicits gen- eralization to composite spatial reasoning. arXiv preprint arXiv:2410.16162, 2024. 8
-
[77]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 2, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[78]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text. arXiv preprint arXiv:2403.05530, 2024. 2, 4, 5, 8, 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
Drivevlm: The convergence of autonomous driving and large vision-language models
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. In CoRL, 2024. 2
work page 2024
-
[80]
E. C. Tolman. Cognitive maps in rats and men. Psycholog- ical Review, 55(4):189–208, 1948. 2, 7
work page 1948
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.