Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency
Pith reviewed 2026-05-20 11:29 UTC · model grok-4.3
The pith
SAGE enforces logical consistency across geometric transformations to strengthen spatial reasoning in vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating logical consistency under geometric and linguistic duality as an auxiliary reward signal inside GRPO training and maintaining a dynamic operation pool that retires mastered transformations while introducing harder ones, the model learns to produce answers that remain coherent across original and transformed inputs, yielding consistent gains over baselines on spatial and video reasoning tasks together with improved generalization.
What carries the argument
The duality consistency reward within GRPO training, driven by a dynamic operation pool that continuously probes for logical inconsistencies using geometric and linguistic transformations.
If this is right
- VLMs exhibit improved accuracy on spatial reasoning benchmarks after applying SAGE.
- The method enhances generalization to data not seen during training.
- SAGE can be used as a lightweight post-training stage on any existing vision-language model.
- Training becomes more data-efficient than previous GRPO approaches by focusing on informative inconsistencies.
- The dynamic pool ensures ongoing focus on challenging operations rather than static ones.
Where Pith is reading between the lines
- This consistency enforcement could be adapted to improve other forms of reasoning such as temporal or causal inference in multimodal models.
- Models trained this way might show greater robustness when deployed in real-world environments with natural variations in viewpoint or description.
- Combining SAGE with other alignment techniques could produce even stronger spatial capabilities.
- Future tests might apply the same duality idea to non-spatial domains to check if the principle is general.
Load-bearing premise
That training for consistency under the probed geometric and linguistic transformations produces deep spatial understanding rather than pattern-matching to the specific operations used.
What would settle it
A test set of novel geometric transformations outside the dynamic operation pool on which the SAGE-trained model shows no improvement or even reduced performance compared to the baseline.
Figures
read the original abstract
Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SAGE (Spatial Alignment via Geometric Evolution), a self-evolving framework to improve spatial reasoning in Vision-Language Models. It adds duality consistency (across geometric and linguistic transformations) as an auxiliary reward inside GRPO training and maintains a dynamic operation pool that surfaces inconsistencies and retires mastered operations. The method is presented as model-agnostic and data-efficient; experiments on video and spatial-reasoning benchmarks are claimed to show gains over strong baselines together with better generalization to unseen data.
Significance. If the empirical claims hold, the work supplies a lightweight post-training recipe that could raise the logical robustness of existing VLMs on spatial tasks without large-scale data collection. The dynamic-pool mechanism for focusing training effort is a concrete engineering contribution that merits attention if it demonstrably avoids superficial consistency.
major comments (3)
- [§4] §4 (Duality Consistency Reward): the reward is defined solely from the model’s own outputs on transformed inputs. The manuscript must specify how the expected answer mappings for each duality operation are independently verified as ground truth rather than inferred from the same model outputs; otherwise the consistency signal risks circularity.
- [§5.2–5.3] §5.2–5.3 (Dynamic Operation Pool and Generalization Experiments): the claim that the pool produces robust generalization beyond the probed transformations requires explicit controls—e.g., held-out transformation families never seen during pool updates and quantitative comparison against a static-pool ablation. Current evidence appears insufficient to rule out memorization of the operation set.
- [§6] §6 (Experimental Results): the abstract and results section assert “consistent improvements” and “enhanced generalization” yet supply no numerical deltas, baseline identifiers, statistical tests, or ablation tables for the dynamic-pool component. These omissions prevent assessment of effect size and reliability.
minor comments (2)
- [Abstract] Abstract: replace the qualitative phrase “consistent improvements over strong baselines” with at least one concrete metric and benchmark name.
- [§3] Notation: define the geometric and linguistic duality operators with a short illustrative example (e.g., rotation + linguistic negation) before the formal reward equation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of clarity, rigor, and evidence that we will address in the revised manuscript. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [§4] §4 (Duality Consistency Reward): the reward is defined solely from the model’s own outputs on transformed inputs. The manuscript must specify how the expected answer mappings for each duality operation are independently verified as ground truth rather than inferred from the same model outputs; otherwise the consistency signal risks circularity.
Authors: We agree that explicit clarification is needed to rule out any perception of circularity. The expected answer mappings in SAGE are derived from predefined geometric and linguistic transformation rules (e.g., explicit spatial relation updates under rotation or translation) that are constructed independently of model outputs and follow deterministic logic. These rules serve as the ground truth for the duality consistency reward. To eliminate ambiguity, we will revise §4 to include a dedicated subsection detailing the construction and independent verification of these mappings, with concrete examples for each operation type. revision: yes
-
Referee: [§5.2–5.3] §5.2–5.3 (Dynamic Operation Pool and Generalization Experiments): the claim that the pool produces robust generalization beyond the probed transformations requires explicit controls—e.g., held-out transformation families never seen during pool updates and quantitative comparison against a static-pool ablation. Current evidence appears insufficient to rule out memorization of the operation set.
Authors: We acknowledge that the current generalization claims would benefit from stronger controls. While the dynamic pool mechanism is intended to prioritize inconsistent operations and retire mastered ones, the manuscript does not yet present held-out transformation families or a direct static-pool ablation. We will add these experiments: (1) a set of held-out transformation families excluded from all pool updates and training, and (2) a quantitative comparison against a static-pool baseline. The new results and analysis will be incorporated into §§5.2–5.3 to better substantiate the generalization claims. revision: yes
-
Referee: [§6] §6 (Experimental Results): the abstract and results section assert “consistent improvements” and “enhanced generalization” yet supply no numerical deltas, baseline identifiers, statistical tests, or ablation tables for the dynamic-pool component. These omissions prevent assessment of effect size and reliability.
Authors: We agree that the results section requires more precise reporting to allow proper evaluation. We will revise §6 (and update the abstract accordingly) to report explicit numerical deltas for all improvements, clearly name all baselines, include statistical significance tests, and add a dedicated ablation table isolating the contribution of the dynamic operation pool. These changes will make effect sizes and reliability transparent. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The SAGE method defines duality consistency as an auxiliary reward signal within GRPO training, where geometric and linguistic transformations are applied to inputs and consistency is encouraged across original/transformed pairs. The dynamic operation pool selects based on detected inconsistencies from model outputs, but this constitutes a standard self-supervised RL loop with externally specified transformation rules rather than a reduction of the central claim to its own fitted outputs by construction. No equations or steps in the abstract or description equate a 'prediction' directly to an input parameter or rename a fit as a first-principles result. Evaluation relies on external video and spatial reasoning benchmarks, making the framework self-contained against independent test distributions. No self-citation chains or uniqueness theorems are invoked as load-bearing in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Enforcing duality consistency between geometric transformations and linguistic answers improves robust spatial reasoning in VLMs
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Has- son, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´...
work page 2022
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report, 2025. URL https://arxiv.org/abs/2511.21631
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,
-
[4]
URLhttps://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=GPKTIktA0k
work page 2024
-
[6]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In 2https://www.anthropic.com/news/claude-opus-4-6 10 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, June 2024
work page 2024
-
[7]
Why is spatial reasoning hard for VLMs? an attention mechanism perspective on focus areas
Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for VLMs? an attention mechanism perspective on focus areas. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=k7vcuqLK4X
work page 2025
-
[8]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Spatialrgpt: Grounded spatial reasoning in vision- language models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision- language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 135062–135093. Curran...
work page 2024
-
[10]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceed...
work page 2017
-
[11]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, and Antoni B. Chan. Viss-r1: Self-supervised reinforcement video reasoning, 2025. URL https://arxiv.org/abs/2511. 13054
work page 2025
-
[13]
Video-r1: Reinforcing video reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in MLLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
-
[14]
URLhttps://openreview.net/forum?id=a2JTVVvcEl
-
[15]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InP...
work page 2025
-
[16]
Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,
-
[17]
URLhttps://arxiv.org/abs/2501.13826
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, XinQiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. InThe Fourteenth International Conference on Learning Representations,
-
[19]
URLhttps://openreview.net/forum?id=6nZKT2rL0H. 11
-
[20]
Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha, Vineeth N Balasubramanian, and Tanuja Ganu. Faithful grpo: Improving visual spatial reasoning in multimodal language models via constrained policy optimization, 2026. URLhttps://arxiv.org/abs/2604.08476
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments
Dinging Li, Yingxiu Zhao, Xinrui Cheng, Kangheng Lin, Hongbo Peng, Hongxing Li, Zixuan Wang, Yuhong Dai, Haodong Li, Jia Wang, Yukang Shi, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Spatialevo: Self-evolving spatial intelligence via deterministic geometric environments, 2026. URL https: //a...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial- bench: Evaluating multi-perspective spatial localization in vision-language models, 2025. URL https://arxiv.org/abs/2505.21500
-
[23]
Spatialladder: Progressive training for spatial reasoning in vision-language models
Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=KtrFXlvgrK
work page 2026
-
[24]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings...
work page 2022
-
[25]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, June 2024
work page 2024
-
[26]
OST-bench: Evaluating the capabilities of MLLMs in online spatio-temporal scene understanding
Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. OST-bench: Evaluating the capabilities of MLLMs in online spatio-temporal scene understanding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net/forum? id=vAkVKIOtcN
work page 2026
-
[27]
Transactions of the Association for Computational Linguistics , volume =
Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 06 2023. ISSN 2307-387X. doi: 10.1162/tacl_a_00566. URLhttps://doi.org/10.1162/tacl_a_00566
-
[28]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 6dcf277ea32ce3288...
work page 2023
-
[29]
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. TempCompass: Do video LLMs really understand videos? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, Bangkok, Thailand, August 2024. Association for Computatio...
-
[30]
Video-ChatGPT: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-ChatGPT: Towards detailed video understanding via large vision and language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, Bangkok, Thailan...
work page 2024
-
[31]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning, 2025. URL https://arxiv. org/abs/2504.01805
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...
work page 2022
-
[33]
Metaspatial: Reinforcing 3d spatial reasoning in VLMs for the metaverse
Zhenyu Pan and Han Liu. Metaspatial: Reinforcing 3d spatial reasoning in VLMs for the metaverse. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=EdQzLC0Zra
work page 2026
-
[34]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...
work page 2021
-
[35]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 53728–53741. Curran Associ...
work page 2023
-
[36]
Vision- language models are zero-shot reward models for reinforcement learning
Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision- language models are zero-shot reward models for reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=N0I2RtD8je
work page 2024
-
[37]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Ntu rgb+d: A large scale dataset for 3d human activity analysis
Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
work page 2016
-
[39]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26160–26169, June 2025
work page 2025
-
[41]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601.03267
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Learning to summarize with human feedback
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc., 2...
work page 2020
-
[43]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/ 2403.05530
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
More thought, less accuracy? on the dual nature of rea- soning in vision-language models
Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Henry Tu, and Jing Zhang. More thought, less accuracy? on the dual nature of rea- soning in vision-language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=XpL5eqjCjF
work page 2026
-
[45]
Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fer- gus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. ...
work page 2024
-
[46]
VL- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. VL- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4oYxzssbVg
work page 2025
-
[47]
SVQA-r1: Reinforcing spatial reasoning in MLLMs via view- consistent reward optimization
Peiyao Wang and Haibin Ling. SVQA-r1: Reinforcing spatial reasoning in MLLMs via view- consistent reward optimization. InThe First Workshop on Efficient Spatial Reasoning, 2026. URLhttps://openreview.net/forum?id=o2E8oa2frj
work page 2026
-
[48]
VideoRFT: Incentivizing video rea- soning capability in MLLMs via reinforced fine-tuning
Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. VideoRFT: Incentivizing video rea- soning capability in MLLMs via reinforced fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=3pORFyKzh1
work page 2025
-
[49]
Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B
Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, and Mohit Bansal. Video-RTS: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in N...
-
[50]
URLhttps://aclanthology.org/2025.emnlp-main.1428/
work page 2025
-
[51]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Assoc...
work page 2022
-
[52]
Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= RnXS7aK4rK
work page 2025
-
[54]
URLhttps://openreview.net/forum?id=yyWeSAsOhs
-
[55]
Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
-
[56]
URLhttps://openreview.net/forum?id=yyWeSAsOhs. 14
-
[57]
FOCUS: Unified vision-language modeling for interactive editing driven by referential segmentation
Fan Yang, Yousong Zhu, Xin Li, Yufei Zhan, Hongyin Zhao, Shurong Zheng, Yaowei Wang, Ming Tang, and Jinqiao Wang. FOCUS: Unified vision-language modeling for interactive editing driven by referential segmentation. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. URLhttps://openreview.net/forum?id=FACJ0478oQ
work page 2025
-
[58]
Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie
Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, June 2025
work page 2025
-
[59]
Visionzip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19792–19802, June 2025
work page 2025
-
[60]
Visionthink: Smart and efficient vision language model via reinforcement learning
Senqiao Yang, Junyi Li, Xin Lai, Jinming Wu, Wei Li, Zejun MA, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=R6m6bNnmWm
work page 2025
-
[61]
MMSI- bench: A benchmark for multi-image spatial intelligence
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI- bench: A benchmark for multi-image spatial intelligence. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=gHRoX4vXm3
work page 2026
-
[62]
Spatial mental modeling from limited views
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial mental modeling from limited views. InNeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning, 2025. URL https:...
work page 2025
-
[63]
Fine-tuning large vision-language models as decision-making agents via reinforcement learning
Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Informati...
-
[64]
Video-LLaMA: An instruction-tuned audio-visual language model for video understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Yansong Feng and Els Lefever, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore, December 2023. Association for Computational Linguistic...
-
[65]
From flatland to space: Teaching vision-language models to perceive and reason in 3d
Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,
-
[66]
URLhttps://openreview.net/forum?id=GzgPleFl8f
-
[67]
Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8): 5625–5644, 2024. doi: 10.1109/TPAMI.2024.3369699
-
[68]
CoFFT: Chain of foresight-focus thought for visual language models
Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, and Mike Zheng Shou. CoFFT: Chain of foresight-focus thought for visual language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
-
[69]
URLhttps://openreview.net/forum?id=AQnjBIFCBQ. 15
-
[70]
Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, Xinlei Chen, Yong Li, and Wenwu Zhu. Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning. InProceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 11071–11080, New York, ...
-
[71]
Mmvu: Measuring expert-level multi-discipline video understanding
Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, Chengye Wang, Ziyao Shangguan, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. Mmvu: Measuring expert-level multi-discipline video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (...
work page 2025
-
[72]
Towards learning a generalist model for embodied navigation
Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13624–13634, June 2024
work page 2024
-
[73]
International Journal of Computer Vision , author =
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision- language models.International Journal of Computer Vision, 130(9):2337–2348, 2022. doi: 10.1007/s11263-022-01653-1. URLhttps://doi.org/10.1007/s11263-022-01653-1
-
[74]
Program-of-thought reveals LLM abstraction ceilings
Mike Zhou, Fenil Bardoliya, Vivek Gupta, and Dan Roth. Program-of-thought reveals LLM abstraction ceilings. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, pages 4911–4919, Rabat, Morocco, March 2026. Association for Computational Linguistics. ISBN 979-8-89176-386-
work page 2026
-
[75]
URL https://aclanthology.org/2026
doi: 10.18653/v1/2026.findings-eacl.257. URL https://aclanthology.org/2026. findings-eacl.257/
-
[76]
Open- drivevla: Towards end-to-end autonomous driving with large vision language action model
Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Open- drivevla: Towards end-to-end autonomous driving with large vision language action model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16):13782–13790, Mar
-
[77]
What is the action performed by the person in the video?
doi: 10.1609/aaai.v40i16.38386. URL https://ojs.aaai.org/index.php/AAAI/ article/view/38386. 16 A Pseudocode In this section, we present the pseudocode of SAGE, as shown in Algorithm 1. Algorithm 1Self-Evolving Consistency Training 1: Input:VLM Mθ, reference Mθref, dataset D, pool P, eval interval E, max active K, mastery thresholdτ 2:Initialize active se...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.