Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Anjali W. Gupta; Jihan Yang; Li Fei-Fei; Rilyn Han; Saining Xie; Shusheng Yang

arxiv: 2412.14171 · v2 · pith:BY5ENMHMnew · submitted 2024-12-18 · 💻 cs.CV

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang , Shusheng Yang , Anjali W. Gupta , Rilyn Han , Li Fei-Fei , Saining Xie This is my paper

Pith reviewed 2026-05-22 09:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsvisual-spatial intelligencecognitive mapsspatial reasoningVSI-Benchvideo understandingworld models

0 comments

The pith

Multimodal large language models think in space from videos but stay subhuman, limited mainly by spatial reasoning while cognitive maps help distance tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests if MLLMs trained on large video data can remember and reason about physical spaces the way humans do from sequential views. It introduces VSI-Bench with more than 5,000 questions and shows models reach competitive performance yet fall short of humans. Standard language-based reasoning tricks fail to raise scores, but requiring the model to generate an explicit cognitive map during answering improves results on distance questions. This indicates that some spatial awareness arises inside these models even without direct supervision for it.

Core claim

MLLMs exhibit competitive though subhuman visual-spatial intelligence on VSI-Bench; spatial reasoning remains the primary bottleneck, local world models and spatial awareness emerge, linguistic reasoning techniques fail to improve performance, and explicitly generating cognitive maps enhances spatial distance ability.

What carries the argument

VSI-Bench, a video-based visual-spatial intelligence benchmark of over 5,000 question-answer pairs, together with explicit cognitive map generation as a reasoning step that improves distance judgments.

If this is right

Spatial reasoning, not perception or memory, sets the main limit on MLLM performance for space-related tasks.
Local world models and spatial awareness form inside MLLMs from video training alone.
Chain-of-thought, self-consistency, and tree-of-thoughts produce no gains on these spatial questions.
Forcing explicit cognitive map output specifically boosts accuracy on spatial distance items.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future model designs could add dedicated modules for maintaining and updating spatial maps to close the remaining gap with humans.
The same video training that produces these abilities might already support downstream uses such as navigation planning or scene reconstruction.
New tests could check whether the observed spatial awareness transfers to physical robots or to environments never seen during pretraining.

Load-bearing premise

The VSI-Bench questions and evaluation protocol accurately measure human-like visual-spatial intelligence rather than surface-level pattern matching or dataset artifacts.

What would settle it

Human participants scoring only marginally above the best MLLMs on the full VSI-Bench, or MLLMs maintaining high scores on distance questions even when prevented from generating cognitive maps.

read the original abstract

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VSI-Bench gives a useful first look at video spatial reasoning in MLLMs but the questions may still allow non-spatial shortcuts.

read the letter

The main things to know are that this paper releases VSI-Bench with over 5,000 video-based questions and reports that MLLMs reach competitive but subhuman scores, with explicit cognitive-map generation lifting distance performance while chain-of-thought and similar tricks do not. The work also claims that local world models and spatial awareness appear inside the models anyway. That combination of a new benchmark plus a simple intervention that moves one metric is the concrete contribution here. The authors run a range of models, compare against human baselines, and show clear gaps on spatial tasks, which is straightforward empirical work that fills a gap in current video evaluation suites. The map-generation result is the part that stands out as actionable rather than just another leaderboard entry. The soft spots sit mostly in the benchmark design itself. The stress-test concern lands: if many items can be solved by counting objects across frames, using temporal co-occurrence from pretraining, or leaning on linguistic priors, then the performance gap and the map benefit both become harder to interpret as evidence of genuine spatial memory. The abstract gives no numbers on question validation, inter-annotator agreement, or controls for video length and model scale, so the central claim that spatial reasoning is the primary bottleneck rests on an assumption that needs tighter checks in the full paper. Minor issues include the usual lack of statistical tests reported in the summary, but those are fixable. This paper is for groups building or evaluating multimodal models for robotics and navigation. Anyone who needs a video spatial benchmark to test against will find the dataset and the map-generation baseline worth looking at. It is coherent on its own terms and shows honest engagement with the evaluation problem, so it deserves a serious referee rather than a desk reject. I would send it for review and ask specifically for more detail on how the questions were constructed to rule out surface cues.

Referee Report

3 major / 2 minor

Summary. The paper introduces VSI-Bench, a new video-based benchmark with over 5,000 QA pairs, to evaluate visual-spatial intelligence in MLLMs. It reports that current MLLMs achieve competitive but subhuman performance on the benchmark, identifies spatial reasoning as the primary bottleneck, shows that linguistic techniques such as chain-of-thought fail to help, and finds that explicitly generating cognitive maps during inference improves performance on spatial distance tasks.

Significance. If the benchmark accurately isolates sequential spatial reasoning, the work supplies a useful new evaluation resource and empirical evidence that explicit cognitive-map generation can mitigate a key limitation in MLLMs. The new data collection and the contrast between failed linguistic interventions and successful map generation are concrete strengths that could inform future model design for better world modeling.

major comments (3)

[§3] §3 (VSI-Bench construction): the paper provides no details on question validation, inter-annotator agreement, or explicit controls that would rule out solutions via frame-level object statistics, temporal co-occurrence patterns, or linguistic priors rather than genuine sequential spatial reconstruction.
[§5] §5 (Experiments and results): performance gaps and the reported benefit of cognitive-map generation are presented without statistical significance tests, without ablations that hold video length and model scale constant, and without human baselines collected under matched conditions.
[§6] §6 (Analysis of reasoning strategies): the claim that spatial reasoning remains the primary bottleneck rests on the assumption that VSI-Bench items cannot be solved by surface cues; no diagnostic experiments or error analysis are supplied to support this assumption.

minor comments (2)

[Figures] Figure captions and axis labels in several result plots omit units or confidence intervals, reducing readability.
[Related Work] A small number of recent works on spatial reasoning benchmarks for vision-language models are missing from the related-work section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [§3] §3 (VSI-Bench construction): the paper provides no details on question validation, inter-annotator agreement, or explicit controls that would rule out solutions via frame-level object statistics, temporal co-occurrence patterns, or linguistic priors rather than genuine sequential spatial reconstruction.

Authors: We appreciate the referee's emphasis on rigorous benchmark validation. The original manuscript outlines the data collection pipeline in Section 3 but does not provide sufficient specifics on validation steps. In the revised manuscript, we will expand this section to report: (i) the multi-stage question validation process involving independent annotators, (ii) inter-annotator agreement statistics such as Cohen's kappa, and (iii) explicit filtering and control measures (e.g., adversarial question design and statistical checks) intended to prevent solutions based on single-frame cues, temporal co-occurrences, or linguistic shortcuts. These additions will more clearly establish that successful performance requires sequential spatial reconstruction. revision: yes
Referee: [§5] §5 (Experiments and results): performance gaps and the reported benefit of cognitive-map generation are presented without statistical significance tests, without ablations that hold video length and model scale constant, and without human baselines collected under matched conditions.

Authors: We agree that stronger statistical support and controlled comparisons would improve the results section. In the revision we will: (1) include statistical significance tests (paired t-tests or appropriate non-parametric equivalents) for key performance differences, (2) add or clarify ablations that hold video length and model scale fixed where feasible, and (3) provide additional details on the human baseline collection protocol, including the number of participants, instructions given, and how conditions were matched to model evaluations. These changes will allow readers to better assess the reliability of the reported gaps and the benefit of cognitive-map generation. revision: yes
Referee: [§6] §6 (Analysis of reasoning strategies): the claim that spatial reasoning remains the primary bottleneck rests on the assumption that VSI-Bench items cannot be solved by surface cues; no diagnostic experiments or error analysis are supplied to support this assumption.

Authors: We acknowledge that the current analysis would benefit from more direct evidence that surface cues are insufficient. We will add a dedicated subsection presenting diagnostic experiments (e.g., performance on variants where spatial relations are perturbed while preserving low-level statistics) together with a systematic error analysis that categorizes model failures according to whether they stem from spatial mis-reasoning versus other factors. This will provide empirical support for identifying spatial reasoning as the primary bottleneck. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no load-bearing circularity

full rationale

The paper introduces VSI-Bench (a new video-based visual-spatial intelligence benchmark with over 5,000 QA pairs) and reports empirical evaluations of MLLMs on it, including probes for linguistic vs. visual thinking and the effect of explicit cognitive map generation. No equations, fitted parameters, or derivation chains exist that reduce the reported performance numbers or intervention benefits to prior self-citations by construction. Central claims rest on fresh data collection and controlled experiments rather than self-referential definitions or uniqueness theorems imported from the authors' prior work. Any self-citations present are incidental and non-load-bearing for the benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Claims rest on the unverified assumption that the new benchmark faithfully captures visual-spatial intelligence and that model outputs during probing reflect genuine internal spatial representations rather than prompt artifacts.

axioms (1)

domain assumption The constructed VSI-Bench questions validly measure visual-spatial intelligence comparable to human performance
This premise is required to interpret subhuman model scores and the cognitive-map improvement as evidence about model capabilities.

pith-pipeline@v0.9.0 · 5706 in / 1170 out tokens · 30397 ms · 2026-05-22T09:23:14.427608+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs... MLLMs exhibit competitive - though subhuman - visual-spatial intelligence... explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
cs.CV 2026-05 unverdicted novelty 7.0

SpaceDG introduces the first large-scale degradation-aware spatial reasoning dataset using 3D Gaussian Splatting synthesis, showing that visual degradations impair MLLM performance but finetuning on the data improves ...
Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
cs.CV 2026-05 unverdicted novelty 7.0

The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.
SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?
cs.AI 2026-05 unverdicted novelty 7.0

SaaS-Bench provides 106 realistic professional tasks across 23 deployable SaaS platforms to evaluate LLM-based agents, finding that even the strongest models complete fewer than 4% of tasks end-to-end.
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
cs.CV 2026-04 unverdicted novelty 7.0

EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
cs.CV 2025-05 conditional novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
cs.CV 2025-04 unverdicted novelty 7.0

SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
cs.CV 2026-05 unverdicted novelty 6.0

Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
Video-ToC: Video Tree-of-Cue Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
cs.RO 2025-08 conditional novelty 6.0

Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks ...
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
cs.CV 2025-06 unverdicted novelty 6.0

VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
cs.CV 2025-05 unverdicted novelty 6.0

Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
cs.CV 2025-05 unverdicted novelty 6.0

Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.
Grounded Reinforcement Learning for Visual Reasoning
cs.CV 2025-05 unverdicted novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
cs.CV 2025-05 unverdicted novelty 6.0

VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
cs.CV 2026-04 unverdicted novelty 5.0

MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
cs.RO 2025-01 unverdicted novelty 5.0

SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...
LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map
cs.CV 2026-05 unverdicted novelty 4.0

LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning
cs.CV 2025-06 unverdicted novelty 4.0

Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · cited by 23 Pith papers · 20 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022. 2

work page 2022
[2]

Working memory

Alan Baddeley. Working memory. Science, 255(5044): 556–559, 1992. 2

work page 1992
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InNeurIPS,

work page
[6]

Large linguistic models: Analyzing theoretical linguistic abilities of llms

Gašper Beguš, Maksymilian D ˛ abkowski, and Ryan Rhodes. Large linguistic models: Analyzing theoretical linguistic abilities of llms. arXiv preprint arXiv:2305.00948 , 2023. 2

work page arXiv 2023
[7]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023. 2

work page 2023
[8]

Rt-1: Robotics transformer for real-world con- trol at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world con- trol at scale. In RSS, 2023. 2

work page 2023
[9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, S...

work page 2020
[10]

Spa- tialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xi- aoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spa- tialbot: Precise spatial understanding with vision language models. arXiv preprint arXiv:2406.13642, 2024. 8

work page arXiv 2024
[11]

Spatial and object visualization cognitive styles: Validation studies in 3800 individuals

Christopher F Chabris, Thomas E Jerde, Anita W Woolley, Margaret E Gerbasi, Jonathon P Schuldt, Sean L Bennett, J Richard Hackman, and Stephen M Kosslyn. Spatial and object visualization cognitive styles: Validation studies in 3800 individuals. Group brain technical report , 2:1–20,

work page
[12]

Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Fei-Fei Li

Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Fei-Fei Li. Hourvideo: 1-hour video-language understanding. In NeurIPS, 2024. 2, 17

work page 2024
[13]

Spatialvlm: Endow- ing vision-language models with spatial reasoning capabil- ities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabil- ities. In CVPR, 2024. 8

work page 2024
[14]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open- source suites. arXiv preprint arXiv:2404.16821, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 2

work page 2024
[16]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. In NeurIPS, 2024. 8

work page 2024
[17]

Spatially-aware transformers for embodied agents

Junmo Cho, Jaesik Yoon, and Sungjin Ahn. Spatially-aware transformers for embodied agents. In ICLR, 2023. 8

work page 2023
[18]

Clark and Allan Paivio

James M. Clark and Allan Paivio. Dual coding theory and education. Educational Psychology Review, 3(3):149–210,

work page
[19]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 3, 13

work page 2017
[20]

Milton J. Dehn. Working Memory and Academic Learning: Assessment and Intervention. John Wiley & Sons, 2011. 3

work page 2011
[21]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. In ICML, 2023. 2, 8

work page 2023
[22]

The pascal visual object classes (voc) challenge

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010. 4

work page 2010
[23]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing. In NeurIPS, 2024. 8

work page 2024
[24]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 4, 7, 8, 16

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Self- explanation prompting improves dialogue understanding in large language models

Haoyu Gao, Ting-En Lin, Hangyu Li, Min Yang, Yuchuan Wu, Wentao Ma, Fei Huang, and Yongbin Li. Self- explanation prompting improves dialogue understanding in large language models. In COLING, 2024. 5

work page 2024
[26]

Frames of Mind: The Theory of Multi- ple Intelligences

Howard Gardner. Frames of Mind: The Theory of Multi- ple Intelligences. Basic Books, tenth-anniversary edition, second paperback edition edition, 1983. 2

work page 1983
[27]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022. 2

work page 2022
[28]

A real-world webagent with planning, long context under- standing, and program synthesis

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Saf- dari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context under- standing, and program synthesis. In ICLR, 2024. 2

work page 2024
[29]

Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. arXiv preprint arXiv:2411.14794, 2024. 8

work page arXiv 2024
[30]

Masked autoencoders are scal- able vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scal- able vision learners. In CVPR, 2022. 8

work page 2022
[31]

Mea- suring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. In ICLR,

work page
[32]

Can large language models explain themselves? a study of llm-generated self- explanations

Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, and Leilani H Gilpin. Can large language models explain themselves? a study of llm-generated self- explanations. arXiv preprint arXiv:2310.11207, 2023. 5

work page arXiv 2023
[33]

Language models as zero-shot planners: Ex- tracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Ex- tracting actionable knowledge for embodied agents. In ICML, 2022. 2, 6

work page 2022
[34]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2, 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024. 2, 6

work page 2024
[36]

Language models with rationality

Nora Kassner, Oyvind Tafjord, Ashish Sabharwal, Kyle Richardson, Hinrich Schütze, and Peter Clark. Language models with rationality. In EMNLP, 2023. 2

work page 2023
[37]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. In CoRL,

work page
[38]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In NeurIPS, 2022. 7, 15

work page 2022
[39]

Seed-bench: Bench- marking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. In CVPR,

work page
[40]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Topviewrs: Vision-language models as top-view spatial reasoners

Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli ´c. Topviewrs: Vision-language models as top-view spatial reasoners. arXiv preprint arXiv:2406.02537, 2024. 8

work page arXiv 2024
[42]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 2

work page 2023
[43]

Mvbench: A comprehensive multi-modal video under- standing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. In CVPR, 2024. 8

work page 2024
[44]

Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models

Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu Sun, and Lu Hou. Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models. arXiv preprint arXiv:2311.17404, 2023. 8

work page arXiv 2023
[45]

Vila: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. In CVPR, 2024. 4

work page 2024
[46]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 4

work page 2014
[47]

Multi- modal situated reasoning in 3d scenes

Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiao- jian Shawn Ma, Baoxiong Jia, and Siyuan Huang. Multi- modal situated reasoning in 3d scenes. Advances in Neu- ral Information Processing Systems , 37:140903–140936,

work page
[48]

Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model

Benlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yan- song Tang, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model. arXiv preprint arXiv:2408.00754,

work page arXiv
[49]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 2024. 2, 17

work page 2024
[50]

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Temp- Compass: Do video LLMs really understand videos? In Findings of ACL, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Temp- Compass: Do video LLMs really understand videos? In Findings of ACL, 2024. 8

work page 2024
[52]

Mmbench: Is your multi- modal model an all-around player? In ECCV, 2025

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? In ECCV, 2025. 8

work page 2025
[53]

Faithful chain-of-thought reasoning

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison- Burch. Faithful chain-of-thought reasoning. In ACL, 2023. 5

work page 2023
[54]

Openeqa: Embodied question answering in the era of foun- dation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foun- dation models. In CVPR, 2024. 8, 17

work page 2024
[55]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. NeurIPS, 2023. 2, 8, 16

work page 2023
[56]

Julia McAfoose and Bernhard T. Baune. Exploring vi- sual–spatial working memory: A critical review of concepts and models. Neuropsychology Review, 2009. 3

work page 2009
[57]

Individual dif- ferences in navigation: an introductory overview

Chiara Meneghetti, Laura Miola, Tommaso Feraco, Veron- ica Muffato, and Tommaso Feraco Miola. Individual dif- ferences in navigation: an introductory overview. Prime archives in psychology, 2022. 2

work page 2022
[58]

Evaluating cognitive maps and planning in large language models with cogeval

Ida Momennejad, Hosein Hasanbeig, Felipe Vieira Fru- jeri, Hiteshi Sharma, Nebojsa Jojic, Hamid Palangi, Robert Ness, and Jonathan Larson. Evaluating cognitive maps and planning in large language models with cogeval. NeurIPS,

work page
[59]

Embodiedgpt: Vision-language pre-training via embodied chain of thought

Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. NeurIPS, 2024. 8

work page 2024
[60]

The Hippocampus and Context Revisited

Lynn Nadel. The Hippocampus and Context Revisited. Ox- ford University Press, 2008. 7

work page 2008
[61]

A Comprehensive Overview of Large Language Models

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham- mad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehen- sive overview of large language models. arXiv preprint arXiv:2307.06435, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Newcombe

Nora S. Newcombe. Spatial Cognition. MIT Press, 2024. https://oecs.mit.edu/pub/or750iar. 2, 7

work page 2024
[63]

Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based 10 large language models

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluat- ing video-based large language models. arXiv preprint arXiv:2311.16103, 2023. 8

work page arXiv 2023
[64]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El- Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patric...

work page 2024
[66]

On measuring faith- fulness or self-consistency of natural language explana- tions

Letitia Parcalabescu and Anette Frank. On measuring faith- fulness or self-consistency of natural language explana- tions. In ACL, 2024. 5

work page 2024
[67]

Improving language understanding by gen- erative pre-training

Alec Radford. Improving language understanding by gen- erative pre-training. OpenAI Blog, 2018. 2, 8

work page 2018
[68]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,

work page
[69]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 8

work page 2021
[70]

Does spatial cog- nition emerge in frontier models? arXiv preprint arXiv:2410.06468, 2024

Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl, and Vladlen Koltun. Does spatial cog- nition emerge in frontier models? arXiv preprint arXiv:2410.06468, 2024. 8

work page arXiv 2024
[71]

why should i trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. In KDD, 2016. 5

work page 2016
[72]

Grounding natu- ral language instructions: Can large language models cap- ture spatial information? arXiv preprint arXiv:2109.08634,

Julia Rozanova, Deborah Ferreira, Krishna Dubba, Weiwei Cheng, Dell Zhang, and Andre Freitas. Grounding natu- ral language instructions: Can large language models cap- ture spatial information? arXiv preprint arXiv:2109.08634,

work page arXiv
[73]

Gerard Salton and Michael J. McGill. Introduction to Mod- ern Information Retrieval. McGraw-Hill, Inc., USA, 1986. 4

work page 1986
[74]

Mental rotation: ef- fects of dimensionality of objects and type of task

Shenna Shepard and Douglas Metzler. Mental rotation: ef- fects of dimensionality of objects and type of task. Journal of experimental psychology: Human perception and perfor- mance, 14(1):3, 1988. 2

work page 1988
[75]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. In ICCV, 2023. 6

work page 2023
[76]

Sparkle: Mastering basic spatial capabilities in vision language models elicits gen- eralization to composite spatial reasoning

Yihong Tang, Ao Qu, Zhaokai Wang, Dingyi Zhuang, Zhaofeng Wu, Wei Ma, Shenhao Wang, Yunhan Zheng, Zhan Zhao, and Jinhua Zhao. Sparkle: Mastering basic spatial capabilities in vision language models elicits gen- eralization to composite spatial reasoning. arXiv preprint arXiv:2410.16162, 2024. 8

work page arXiv 2024
[77]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text. arXiv preprint arXiv:2403.05530, 2024. 2, 4, 5, 8, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Drivevlm: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. In CoRL, 2024. 2

work page 2024
[80]

E. C. Tolman. Cognitive maps in rats and men. Psycholog- ical Review, 55(4):189–208, 1948. 2, 7

work page 1948

Showing first 80 references.

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022. 2

work page 2022

[2] [2]

Working memory

Alan Baddeley. Working memory. Science, 255(5044): 556–559, 1992. 2

work page 1992

[3] [3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InNeurIPS,

work page

[6] [6]

Large linguistic models: Analyzing theoretical linguistic abilities of llms

Gašper Beguš, Maksymilian D ˛ abkowski, and Ryan Rhodes. Large linguistic models: Analyzing theoretical linguistic abilities of llms. arXiv preprint arXiv:2305.00948 , 2023. 2

work page arXiv 2023

[7] [7]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023. 2

work page 2023

[8] [8]

Rt-1: Robotics transformer for real-world con- trol at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world con- trol at scale. In RSS, 2023. 2

work page 2023

[9] [9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, S...

work page 2020

[10] [10]

Spa- tialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xi- aoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spa- tialbot: Precise spatial understanding with vision language models. arXiv preprint arXiv:2406.13642, 2024. 8

work page arXiv 2024

[11] [11]

Spatial and object visualization cognitive styles: Validation studies in 3800 individuals

Christopher F Chabris, Thomas E Jerde, Anita W Woolley, Margaret E Gerbasi, Jonathon P Schuldt, Sean L Bennett, J Richard Hackman, and Stephen M Kosslyn. Spatial and object visualization cognitive styles: Validation studies in 3800 individuals. Group brain technical report , 2:1–20,

work page

[12] [12]

Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Fei-Fei Li

Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Fei-Fei Li. Hourvideo: 1-hour video-language understanding. In NeurIPS, 2024. 2, 17

work page 2024

[13] [13]

Spatialvlm: Endow- ing vision-language models with spatial reasoning capabil- ities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabil- ities. In CVPR, 2024. 8

work page 2024

[14] [14]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open- source suites. arXiv preprint arXiv:2404.16821, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 2

work page 2024

[16] [16]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. In NeurIPS, 2024. 8

work page 2024

[17] [17]

Spatially-aware transformers for embodied agents

Junmo Cho, Jaesik Yoon, and Sungjin Ahn. Spatially-aware transformers for embodied agents. In ICLR, 2023. 8

work page 2023

[18] [18]

Clark and Allan Paivio

James M. Clark and Allan Paivio. Dual coding theory and education. Educational Psychology Review, 3(3):149–210,

work page

[19] [19]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 3, 13

work page 2017

[20] [20]

Milton J. Dehn. Working Memory and Academic Learning: Assessment and Intervention. John Wiley & Sons, 2011. 3

work page 2011

[21] [21]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. In ICML, 2023. 2, 8

work page 2023

[22] [22]

The pascal visual object classes (voc) challenge

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010. 4

work page 2010

[23] [23]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing. In NeurIPS, 2024. 8

work page 2024

[24] [24]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 4, 7, 8, 16

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Self- explanation prompting improves dialogue understanding in large language models

Haoyu Gao, Ting-En Lin, Hangyu Li, Min Yang, Yuchuan Wu, Wentao Ma, Fei Huang, and Yongbin Li. Self- explanation prompting improves dialogue understanding in large language models. In COLING, 2024. 5

work page 2024

[26] [26]

Frames of Mind: The Theory of Multi- ple Intelligences

Howard Gardner. Frames of Mind: The Theory of Multi- ple Intelligences. Basic Books, tenth-anniversary edition, second paperback edition edition, 1983. 2

work page 1983

[27] [27]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022. 2

work page 2022

[28] [28]

A real-world webagent with planning, long context under- standing, and program synthesis

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Saf- dari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context under- standing, and program synthesis. In ICLR, 2024. 2

work page 2024

[29] [29]

Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. arXiv preprint arXiv:2411.14794, 2024. 8

work page arXiv 2024

[30] [30]

Masked autoencoders are scal- able vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scal- able vision learners. In CVPR, 2022. 8

work page 2022

[31] [31]

Mea- suring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. In ICLR,

work page

[32] [32]

Can large language models explain themselves? a study of llm-generated self- explanations

Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, and Leilani H Gilpin. Can large language models explain themselves? a study of llm-generated self- explanations. arXiv preprint arXiv:2310.11207, 2023. 5

work page arXiv 2023

[33] [33]

Language models as zero-shot planners: Ex- tracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Ex- tracting actionable knowledge for embodied agents. In ICML, 2022. 2, 6

work page 2022

[34] [34]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2, 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In ICLR, 2024. 2, 6

work page 2024

[36] [36]

Language models with rationality

Nora Kassner, Oyvind Tafjord, Ashish Sabharwal, Kyle Richardson, Hinrich Schütze, and Peter Clark. Language models with rationality. In EMNLP, 2023. 2

work page 2023

[37] [37]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. In CoRL,

work page

[38] [38]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In NeurIPS, 2022. 7, 15

work page 2022

[39] [39]

Seed-bench: Bench- marking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Bench- marking multimodal large language models. In CVPR,

work page

[40] [40]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Topviewrs: Vision-language models as top-view spatial reasoners

Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli ´c. Topviewrs: Vision-language models as top-view spatial reasoners. arXiv preprint arXiv:2406.02537, 2024. 8

work page arXiv 2024

[42] [42]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 2

work page 2023

[43] [43]

Mvbench: A comprehensive multi-modal video under- standing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. In CVPR, 2024. 8

work page 2024

[44] [44]

Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models

Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu Sun, and Lu Hou. Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models. arXiv preprint arXiv:2311.17404, 2023. 8

work page arXiv 2023

[45] [45]

Vila: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. In CVPR, 2024. 4

work page 2024

[46] [46]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 4

work page 2014

[47] [47]

Multi- modal situated reasoning in 3d scenes

Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiao- jian Shawn Ma, Baoxiong Jia, and Siyuan Huang. Multi- modal situated reasoning in 3d scenes. Advances in Neu- ral Information Processing Systems , 37:140903–140936,

work page

[48] [48]

Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model

Benlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yan- song Tang, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondence elicit 3d spacetime understanding in mul- timodal language model. arXiv preprint arXiv:2408.00754,

work page arXiv

[49] [49]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 2024. 2, 17

work page 2024

[50] [50]

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Temp- Compass: Do video LLMs really understand videos? In Findings of ACL, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Temp- Compass: Do video LLMs really understand videos? In Findings of ACL, 2024. 8

work page 2024

[52] [52]

Mmbench: Is your multi- modal model an all-around player? In ECCV, 2025

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? In ECCV, 2025. 8

work page 2025

[53] [53]

Faithful chain-of-thought reasoning

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison- Burch. Faithful chain-of-thought reasoning. In ACL, 2023. 5

work page 2023

[54] [54]

Openeqa: Embodied question answering in the era of foun- dation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foun- dation models. In CVPR, 2024. 8, 17

work page 2024

[55] [55]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. NeurIPS, 2023. 2, 8, 16

work page 2023

[56] [56]

Julia McAfoose and Bernhard T. Baune. Exploring vi- sual–spatial working memory: A critical review of concepts and models. Neuropsychology Review, 2009. 3

work page 2009

[57] [57]

Individual dif- ferences in navigation: an introductory overview

Chiara Meneghetti, Laura Miola, Tommaso Feraco, Veron- ica Muffato, and Tommaso Feraco Miola. Individual dif- ferences in navigation: an introductory overview. Prime archives in psychology, 2022. 2

work page 2022

[58] [58]

Evaluating cognitive maps and planning in large language models with cogeval

Ida Momennejad, Hosein Hasanbeig, Felipe Vieira Fru- jeri, Hiteshi Sharma, Nebojsa Jojic, Hamid Palangi, Robert Ness, and Jonathan Larson. Evaluating cognitive maps and planning in large language models with cogeval. NeurIPS,

work page

[59] [59]

Embodiedgpt: Vision-language pre-training via embodied chain of thought

Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. NeurIPS, 2024. 8

work page 2024

[60] [60]

The Hippocampus and Context Revisited

Lynn Nadel. The Hippocampus and Context Revisited. Ox- ford University Press, 2008. 7

work page 2008

[61] [61]

A Comprehensive Overview of Large Language Models

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham- mad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehen- sive overview of large language models. arXiv preprint arXiv:2307.06435, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [62]

Newcombe

Nora S. Newcombe. Spatial Cognition. MIT Press, 2024. https://oecs.mit.edu/pub/or750iar. 2, 7

work page 2024

[63] [63]

Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based 10 large language models

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluat- ing video-based large language models. arXiv preprint arXiv:2311.16103, 2023. 8

work page arXiv 2023

[64] [64]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [65]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El- Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patric...

work page 2024

[66] [66]

On measuring faith- fulness or self-consistency of natural language explana- tions

Letitia Parcalabescu and Anette Frank. On measuring faith- fulness or self-consistency of natural language explana- tions. In ACL, 2024. 5

work page 2024

[67] [67]

Improving language understanding by gen- erative pre-training

Alec Radford. Improving language understanding by gen- erative pre-training. OpenAI Blog, 2018. 2, 8

work page 2018

[68] [68]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,

work page

[69] [69]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 8

work page 2021

[70] [70]

Does spatial cog- nition emerge in frontier models? arXiv preprint arXiv:2410.06468, 2024

Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl, and Vladlen Koltun. Does spatial cog- nition emerge in frontier models? arXiv preprint arXiv:2410.06468, 2024. 8

work page arXiv 2024

[71] [71]

why should i trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. In KDD, 2016. 5

work page 2016

[72] [72]

Grounding natu- ral language instructions: Can large language models cap- ture spatial information? arXiv preprint arXiv:2109.08634,

Julia Rozanova, Deborah Ferreira, Krishna Dubba, Weiwei Cheng, Dell Zhang, and Andre Freitas. Grounding natu- ral language instructions: Can large language models cap- ture spatial information? arXiv preprint arXiv:2109.08634,

work page arXiv

[73] [73]

Gerard Salton and Michael J. McGill. Introduction to Mod- ern Information Retrieval. McGraw-Hill, Inc., USA, 1986. 4

work page 1986

[74] [74]

Mental rotation: ef- fects of dimensionality of objects and type of task

Shenna Shepard and Douglas Metzler. Mental rotation: ef- fects of dimensionality of objects and type of task. Journal of experimental psychology: Human perception and perfor- mance, 14(1):3, 1988. 2

work page 1988

[75] [75]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. In ICCV, 2023. 6

work page 2023

[76] [76]

Sparkle: Mastering basic spatial capabilities in vision language models elicits gen- eralization to composite spatial reasoning

Yihong Tang, Ao Qu, Zhaokai Wang, Dingyi Zhuang, Zhaofeng Wu, Wei Ma, Shenhao Wang, Yunhan Zheng, Zhan Zhao, and Jinhua Zhao. Sparkle: Mastering basic spatial capabilities in vision language models elicits gen- eralization to composite spatial reasoning. arXiv preprint arXiv:2410.16162, 2024. 8

work page arXiv 2024

[77] [77]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[78] [78]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text. arXiv preprint arXiv:2403.05530, 2024. 2, 4, 5, 8, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [79]

Drivevlm: The convergence of autonomous driving and large vision-language models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. In CoRL, 2024. 2

work page 2024

[80] [80]

E. C. Tolman. Cognitive maps in rats and men. Psycholog- ical Review, 55(4):189–208, 1948. 2, 7

work page 1948