Recognition: 2 theorem links
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Pith reviewed 2026-05-15 15:12 UTC · model grok-4.3
The pith
SpaceR uses RL with a map imagination step to lift open MLLMs above GPT-4o on video spatial reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SpaceR achieves state-of-the-art performance on spatial reasoning benchmarks including VSI-Bench, STI-Bench, and SPAR-Bench by training on the SpaceR-151k dataset with Spatially-Guided RLVR, which extends GRPO through a novel map imagination mechanism that encourages the model to infer spatial layouts in its thinking process, while maintaining competitive results on video understanding benchmarks such as Video-MME, TempCompass, and LongVideoBench.
What carries the argument
Spatially-Guided RLVR (SG-RLVR), which extends Group Relative Policy Optimization by inserting a map imagination mechanism that prompts the model to construct spatial layouts during the reasoning trace before producing an answer.
If this is right
- SpaceR surpasses GPT-4o by 11.6% accuracy on VSI-Bench.
- SpaceR reaches performance on par with Gemini-2.0-Flash on VSI-Bench.
- State-of-the-art results hold on STI-Bench and SPAR-Bench.
- General video understanding performance stays competitive on Video-MME, TempCompass, and LongVideoBench.
Where Pith is reading between the lines
- Map imagination guidance could be tested on image-only spatial tasks or on temporal reasoning to see if the same RL structure transfers.
- Specialized verifiable-reward RL loops may narrow the gap between open and proprietary MLLMs on other narrow reasoning skills.
- Independent spatial benchmarks created after model release would provide a stronger test of whether the gains reflect genuine layout understanding.
Load-bearing premise
The map imagination mechanism inside SG-RLVR genuinely improves spatial reasoning rather than merely increasing the chance of producing benchmark-correct answers during RL training.
What would settle it
An ablation that removes the map imagination step from SG-RLVR while keeping the rest of the training procedure would show whether the accuracy gains on VSI-Bench and similar benchmarks disappear.
read the original abstract
Video spatial reasoning, which involves inferring the underlying spatial structure from observed video frames, poses a significant challenge for existing Multimodal Large Language Models (MLLMs). This limitation stems primarily from 1) the absence of high-quality datasets for this task, and 2) the lack of effective training strategies to develop spatial reasoning capabilities. Motivated by the success of Reinforcement Learning with Verifiable Reward (RLVR) in unlocking LLM reasoning abilities, this work aims to improve MLLMs in video spatial reasoning through the RLVR paradigm. To this end, we introduce the $\textbf{SpaceR}$ framework. First, we present $\textbf{SpaceR-151k}$, a dataset with 91k questions spanning diverse spatial reasoning scenarios with verifiable answers, and 60k samples for maintaining general multimodal understanding. Second, we propose $\textbf{Spatially-Guided RLVR (SG-RLVR)}$, a novel reinforcement learning approach that extends Group Relative Policy Optimization (GRPO) with a novel map imagination mechanism, which encourages the model to infer spatial layouts in the thinking process, thereby facilitating more effective spatial reasoning. Extensive experiments demonstrate that SpaceR achieves state-of-the-art performance on spatial reasoning benchmarks (e.g., VSI-Bench, STI-Bench, and SPAR-Bench), while maintaining competitive results on video understanding benchmarks (e.g., Video-MME, TempCompass, and LongVideoBench). Remarkably, SpaceR surpasses the advanced GPT-4o by 11.6\% accuracy on VSI-Bench and is on par with the leading proprietary model Gemini-2.0-Flash, highlighting the effectiveness of our SpaceR-151k dataset and SG-RLVR in reinforcing spatial reasoning ability of MLLMs. Code, model, and dataset are available at https://github.com/OuyangKun10/SpaceR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the SpaceR framework to enhance MLLMs for video spatial reasoning. It contributes the SpaceR-151k dataset (91k verifiable spatial questions plus 60k general samples) and proposes Spatially-Guided RLVR (SG-RLVR), which extends GRPO by adding a map imagination step that encourages explicit spatial layout inference before answer generation. Experiments report SOTA results on VSI-Bench, STI-Bench, and SPAR-Bench, with SpaceR surpassing GPT-4o by 11.6% accuracy on VSI-Bench while remaining competitive on general video benchmarks such as Video-MME.
Significance. If the performance gains prove attributable to the map imagination mechanism rather than generic RL effects or dataset scale, the work would meaningfully advance open-source MLLM spatial reasoning and reduce reliance on proprietary models. The public release of code, model, and dataset supports reproducibility and follow-on research.
major comments (2)
- [Experiments] Experiments section: No ablation is reported that trains standard GRPO versus SG-RLVR on the identical SpaceR-151k data. Without this control, the 11.6% VSI-Bench gain cannot be confidently attributed to the map imagination mechanism rather than reward optimization or data effects alone.
- [Results] Results tables (VSI-Bench, STI-Bench): Baseline implementations for GPT-4o and Gemini-2.0-Flash are not detailed, nor are variance estimates or statistical significance tests provided for the reported accuracy differences, weakening the strength of the SOTA claim.
minor comments (1)
- [Abstract] The abstract and method description could more explicitly separate the contributions of the new dataset from those of the SG-RLVR algorithm.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will incorporate the suggested revisions to strengthen the attribution of results and the transparency of our evaluations.
read point-by-point responses
-
Referee: [Experiments] Experiments section: No ablation is reported that trains standard GRPO versus SG-RLVR on the identical SpaceR-151k data. Without this control, the 11.6% VSI-Bench gain cannot be confidently attributed to the map imagination mechanism rather than reward optimization or data effects alone.
Authors: We agree that this control experiment is essential to isolate the contribution of the map imagination step. In the revised manuscript we will add an ablation that trains the base model with standard GRPO on the exact same SpaceR-151k dataset and directly compares its performance against SG-RLVR on VSI-Bench, STI-Bench, and SPAR-Bench. This will allow readers to attribute gains more confidently to the spatial guidance mechanism. revision: yes
-
Referee: [Results] Results tables (VSI-Bench, STI-Bench): Baseline implementations for GPT-4o and Gemini-2.0-Flash are not detailed, nor are variance estimates or statistical significance tests provided for the reported accuracy differences, weakening the strength of the SOTA claim.
Authors: We acknowledge the need for greater transparency. In the revision we will expand the experimental details section to describe the exact prompts, input formatting, and inference settings used for GPT-4o and Gemini-2.0-Flash. We will also report standard deviations across multiple runs and include statistical significance tests (e.g., paired t-tests) for the key accuracy differences on VSI-Bench and STI-Bench. revision: yes
Circularity Check
No circularity: empirical results on external benchmarks with independent evaluation
full rationale
The paper introduces SpaceR-151k dataset and SG-RLVR training (extending GRPO with map imagination) then reports accuracy gains on held-out benchmarks VSI-Bench, STI-Bench, SPAR-Bench, Video-MME etc. No equations, fitted parameters, or self-citations reduce the headline 11.6% improvement to a quantity defined by the training data itself. The map imagination step is a proposed mechanism whose benefit is measured by downstream benchmark scores rather than by construction or renaming of inputs. All load-bearing claims rest on external test sets and standard RLVR training, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL training hyperparameters
axioms (1)
- domain assumption Verifiable rewards can be reliably assigned to spatial reasoning questions in video
invented entities (1)
-
map imagination mechanism
no independent evidence
Lean theorems connected to this paper
-
DimensionForcinglinking_requires_D3 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SpaceR surpasses the advanced GPT-4o by 11.6% accuracy on VSI-Bench
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
Count Anything at Any Granularity
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Token Warping Helps MLLMs Look from Nearby Viewpoints
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
-
Motion-o: Trajectory-Grounded Video Reasoning
Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
-
SCP: Spatial Causal Prediction in Video
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
-
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
-
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
-
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Proxy3D generates efficient 3D proxy representations via semantic clustering from video frames and aligns them to VLMs through multi-stage training on the new SpaceSpan dataset, achieving competitive performance on 3D...
-
4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
4DThinker enables VLMs to perform dynamic spatial reasoning by internally simulating 4D imagery in latent space, outperforming prior text-based and modular approaches.
-
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
-
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
-
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
-
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
-
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
-
MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.
-
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
-
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
-
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
Reference graph
Works this paper leans on
-
[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433
work page 2015
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198
work page 2024
-
[5]
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner
-
[6]
In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5828–5839
Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5828–5839
-
[7]
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: an embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning , pages 8469–8488
work page 2023
- [8]
-
[9]
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. 2025. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. 2024. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision , pages 5267–5275
work page 2017
-
[12]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card. arXiv preprint arXiv:2412.16720
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Yang Jin, Zehuan Yuan, Yadong Mu, et al. 2022. Embracing consistency: A one-stage approach for spatio-temporal video grounding. Advances in Neural Information Processing Systems , 35:29192–29204
work page 2022
-
[16]
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [17]
- [18]
-
[19]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81. 10
work page 2004
-
[20]
Kun-Yu Lin, Jia-Run Du, Yipeng Gao, Jiaming Zhou, and Wei-Shi Zheng. 2023. Diversifying spatial- temporal perception for video domain generalization. Advances in Neural Information Processing Systems , 36:56012–56026
work page 2023
- [21]
-
[22]
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. 2025. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. 2024. Tempcompass: Do video llms really understand videos? In Findings of the Association for Computational Linguistics ACL 2024, pages 8731–8772
work page 2024
-
[24]
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang
-
[25]
Visual-RFT: Visual Reinforcement Fine-Tuning
Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. 2023. Perception test: A diagnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems , 36:42748– 42761
work page 2023
- [27]
-
[28]
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to- sentence models. In Proceedings of the IEEE international conference on computer vision , pages 2641– 2649
work page 2015
-
[29]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision , 115:211–252
work page 2015
-
[30]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. 2025. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. 2025. Kimi-vl technical report. arXiv preprint arXiv:2504.07491
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1494–1504
work page 2015
-
[34]
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. 2024. Longvideobench: A benchmark for long- context interleaved video-language understanding. Advances in Neural Information Processing Systems , 37:28828–28857
work page 2024
-
[35]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [36]
-
[37]
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2024. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 , pages 69–85. Springer. 11
work page 2016
-
[39]
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision , pages 11975–11986
work page 2023
-
[40]
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. 2025. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. 2025. From flatland to space: Teaching vision-language models to perceive and reason in 3d. arXiv preprint arXiv:2503.22976. 12 A More Details for Data Construction Relative Distance Measuring from the closest point of each o...
-
[42]
If I am standing by the {positioning object} and facing the {orienting object}, is the {querying object} to my left, right, or back? An object is to my back if I would have to turn at least 135 degrees in order to face it
-
[43]
If I am standing by the {positioning object} and facing the {orienting object}, is the {querying object} to my front-left, front-right, back-left, or back-right? Directions refer to the quadrants of a Cartesian plane (assuming I am at the origin and facing the positive y-axis). Appearance Order What will be the first-time appearance order of the following...
-
[44]
What is the length of the longest dimension (length, width, or height) of the {object}, measured in centimeters?
-
[45]
What is the size of this room (in square meters)? If multiple rooms are shown, estimate the size of the combined space. Absolute Distance Measuring from the closest point of each object, what is the direct distance between the {object 1} and the {object 2} (in meters)? Counting How many {object}(s) are in this room? Figure 6: Question templates for QA pai...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.