Recognition: unknown
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
Pith reviewed 2026-05-10 01:20 UTC · model grok-4.3
The pith
Reformulating visual self-supervised tasks as verifiable puzzles supplies automatic rewards for reinforcement learning post-training of multimodal language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SSL-R1 reformulates widely used visual self-supervised tasks into a collection of verifiable visual puzzles. These puzzles generate rewards directly from image data for RL post-training of MLLMs, requiring neither human annotations nor external model supervision. Models trained under this regime show substantial gains on multimodal understanding and reasoning benchmarks.
What carries the argument
Reformulation of visual SSL tasks into verifiable puzzles that yield image-derived rewards for reinforcement learning.
If this is right
- MLLMs exhibit measurable gains on multimodal understanding and reasoning benchmarks after training on the visual puzzles.
- RL post-training becomes feasible at larger scales because rewards no longer require human or external model supervision.
- Vision-centric self-supervised signals can be used to counteract language-centric biases in MLLM training.
- The framework supplies concrete experience for designing additional self-supervised verifiable rewards.
Where Pith is reading between the lines
- The method could be extended to video or 3D data by turning temporal or geometric SSL tasks into similarly verifiable puzzles.
- Combining these visual rewards with existing language-based RLVR signals might produce hybrid training regimes that balance modalities more evenly.
- Models refined this way may display improved transfer to downstream visual tasks that were never used as training puzzles.
Load-bearing premise
Rewards obtained by solving these visual puzzles will strengthen the model's general visual understanding and reasoning instead of merely teaching it to solve the specific puzzles.
What would settle it
Train an MLLM with SSL-R1 and measure its accuracy on held-out multimodal benchmarks such as visual question answering or reasoning tasks; no improvement or a drop relative to a standard baseline would falsify the central claim.
Figures
read the original abstract
Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: https://github.com/Jiahao000/SSL-R1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SSL-R1, a self-supervised RL post-training framework for MLLMs. It reformulates standard visual SSL tasks (e.g., rotation prediction, jigsaw) into verifiable image-based puzzles that generate rewards directly from intrinsic image properties, without human annotations or external models. The central claim is that training MLLMs via RL on these tasks substantially improves performance on multimodal understanding and reasoning benchmarks.
Significance. If the gains prove generalizable rather than puzzle-specific, the work would meaningfully advance scalable RLVR for MLLMs by shifting reward design to vision-centric self-supervision. The approach correctly identifies the annotation bottleneck in prior RLVR methods and proposes a concrete alternative using existing SSL primitives.
major comments (1)
- [Experiments] Experiments section: The central claim that RL post-training on the reformulated puzzles produces transferable visual reasoning (rather than puzzle-format optimization) is load-bearing. The manuscript must include an ablation comparing full SSL-R1 RL against supervised fine-tuning on identical puzzle data; without it, benchmark gains could be explained by task exposure alone. The skeptic's concern is therefore a correctness risk that requires a concrete control experiment.
minor comments (1)
- [Abstract] Abstract: The phrase 'widely-used SSL tasks' should explicitly name the tasks (rotation, jigsaw, inpainting, etc.) and the exact reformulation into question-answer format for immediate clarity.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work and for the constructive major comment. We agree that the requested ablation is necessary to strengthen the central claim and will incorporate it in the revised manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim that RL post-training on the reformulated puzzles produces transferable visual reasoning (rather than puzzle-format optimization) is load-bearing. The manuscript must include an ablation comparing full SSL-R1 RL against supervised fine-tuning on identical puzzle data; without it, benchmark gains could be explained by task exposure alone. The skeptic's concern is therefore a correctness risk that requires a concrete control experiment.
Authors: We agree that this control experiment is essential to rule out the possibility that gains arise merely from exposure to the puzzle formats rather than from the RL optimization itself. In the revised manuscript we will add a direct comparison of SSL-R1 (RL post-training with verifiable rewards) against supervised fine-tuning on the identical set of puzzle instances, using the same verifiable ground-truth answers as supervision targets. This ablation will be reported alongside the existing results in the Experiments section, with details on training hyperparameters and evaluation to ensure fair comparison. We believe the new results will further substantiate that the RL stage yields transferable visual reasoning improvements beyond supervised task exposure. revision: yes
Circularity Check
No circularity in the claimed derivation chain
full rationale
The paper introduces SSL-R1 by reformulating standard visual SSL tasks (e.g., rotation, jigsaw) into verifiable image-based puzzles whose rewards are computed directly from intrinsic image properties without human or external model labels. This reformulation and the subsequent RL post-training step constitute an independent methodological contribution; benchmark gains are reported as empirical outcomes rather than quantities derived by construction from fitted parameters or prior self-citations. No load-bearing uniqueness theorems, ansatzes, or self-referential definitions appear in the provided text, and the central premise does not reduce to renaming or tautological prediction of its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS,
-
[2]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, 2023. 2
2023
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1
work page internal anchor Pith review arXiv 2023
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 5, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Beit: Bert pre-training of image transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. InICLR, 2022. 2
2022
-
[6]
Lan- guage models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. InNeurIPS, 2020. 1, 2
2020
-
[7]
Deep clustering for unsupervised learning of visual features
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. InECCV, 2018. 2
2018
-
[8]
Unsupervised learn- ing of visual features by contrasting cluster assignments
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. In NeurIPS, 2020
2020
-
[9]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2
2021
-
[10]
Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024. 2, 5
2024
-
[11]
Generative pre- training from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. InICML, 2020. 2
2020
-
[12]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 2, 5
2020
-
[13]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,
-
[14]
Instructblip: Towards general-purpose vision-language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023. 1
2023
-
[15]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InNAACL, 2019. 2
2019
-
[16]
Unsuper- vised visual representation learning by context prediction
Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- vised visual representation learning by context prediction. In ICCV, 2015. 2
2015
-
[17]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The 9 llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations
Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations. InNeurIPS,
-
[19]
Eva: Exploring the limits of masked visual representa- tion learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. InCVPR, 2023. 2
2023
-
[20]
Un- supervised representation learning by predicting image rota- tions
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- supervised representation learning by predicting image rota- tions. InICLR, 2018. 2
2018
-
[21]
Bootstrap your own latent: A new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InNeurIPS, 2020. 2
2020
-
[22]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual-language reasoning
Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chen- heng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, et al. Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual-language reasoning. arXiv preprint arXiv:2510.16416, 2025. 2
-
[24]
Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark
Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark. InICML, 2025. 9
2025
-
[25]
Momentum contrast for unsupervised visual rep- resentation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InCVPR, 2020. 2
2020
-
[26]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 2
2022
-
[27]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review arXiv
-
[28]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,
-
[30]
Lisa: Reasoning segmenta- tion via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmenta- tion via large language model. InCVPR, 2024. 2, 5
2024
-
[31]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post- training.arXiv preprint arXiv:2411.15124, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[32]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint arXiv:2305.03726, 2023. 1
work page internal anchor Pith review arXiv 2023
-
[33]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML,
-
[34]
Correlational image modeling for self-supervised visual pre-training
Wei Li, Jiahao Xie, and Chen Change Loy. Correlational image modeling for self-supervised visual pre-training. In CVPR, 2023. 2
2023
-
[35]
Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883, 2025
Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883, 2025. 2
-
[36]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5
2014
-
[37]
Visual spatial reasoning.TACL, 2023
Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.TACL, 2023. 2, 5
2023
-
[38]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1
2023
-
[39]
Mmbench: Is your multi-modal model an all-around player? InECCV, 2024
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 2, 5
2024
-
[40]
Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025. 2
-
[41]
Visual- rft: Visual reinforcement fine-tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning. InICCV, 2025. 2
2025
-
[42]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,
work page internal anchor Pith review arXiv
-
[43]
Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 9
2024
-
[44]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 1, 2
work page Pith review arXiv 2025
-
[45]
The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai
AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai. meta. 10 com/blog/llama-4-multimodal-intelligence/, checked on, 4 (7):2025, 2025. 1
2025
-
[46]
Unsupervised learning of visual representations by solving jigsaw puzzles
Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. InECCV,
-
[47]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Training lan- guage models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. InNeurIPS, 2022. 3
2022
-
[50]
Context encoders: Feature learning by inpainting
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. InCVPR, 2016. 2
2016
-
[51]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2
2021
-
[52]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[53]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[56]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[58]
Winoground: Probing vision and language models for visio- linguistic compositionality
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InCVPR, 2022. 2, 6
2022
-
[59]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 2, 5
2024
-
[60]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[61]
Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning
Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning. In NeurIPS, 2025. 2
2025
-
[62]
Mea- suring multimodal mathematical reasoning with math-vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. InNeurIPS, 2024. 9
2024
-
[63]
Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In AAAI, 2025. 2, 5
2025
-
[64]
Llava-critic-r1: Your critic model is secretly a strong policy model.CoRR, abs/2509.00676, 2025
Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, and Furong Huang. Llava-critic-r1: Your critic model is secretly a strong policy model.arXiv preprint arXiv:2509.00676, 2025. 2, 5, 7
-
[65]
Vicrit: A verifiable rein- forcement learning proxy task for visual perception in vlms
Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, et al. Vicrit: A verifiable rein- forcement learning proxy task for visual perception in vlms. InNeurIPS, 2025. 2
2025
-
[66]
Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Lin- jie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Li- juan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025. 1, 2, 5, 7, 8
-
[67]
Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, and Matthew B Blaschko. Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puz- zles.arXiv preprint arXiv:2505.23590, 2025. 2
-
[68]
Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[69]
V?: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InCVPR, 2024. 2, 5
2024
-
[70]
Visual jigsaw post-training improves mllms
Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, and Ziwei Liu. Visual jigsaw post-training improves mllms. arXiv preprint arXiv:2509.25190, 2025. 2, 5, 7, 8, 9
-
[71]
Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025
LLM-Core-Team Xiaomi. Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025. 9
-
[72]
Unsupervised object-level representation learning from scene images
Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, and Chen Change Loy. Unsupervised object-level representation learning from scene images. InNeurIPS, 2021. 2
2021
-
[73]
Delving into inter-image invariance for unsupervised visual representations.IJCV, 2022
Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew-Soon Ong, and Chen Change Loy. Delving into inter-image invariance for unsupervised visual representations.IJCV, 2022. 11
2022
-
[74]
Masked frequency modeling for self-supervised visual pre-training
Jiahao Xie, Wei Li, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, and Chen Change Loy. Masked frequency modeling for self-supervised visual pre-training. InICLR, 2023. 2
2023
-
[75]
Depth any- thing v2
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2. InNeurIPS, 2024. 2, 5
2024
-
[76]
How to evaluate the generalization of detection? a bench- mark for comprehensive open-vocabulary detection
Yiyang Yao, Peng Liu, Tiancheng Zhao, Qianqian Zhang, Jiajia Liao, Chunxin Fang, Kyusong Lee, and Qing Wang. How to evaluate the generalization of detection? a bench- mark for comprehensive open-vocabulary detection. In AAAI, 2024. 2, 5
2024
-
[77]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023. 1
work page Pith review arXiv 2023
-
[78]
arXiv preprint arXiv:2504.07954 , year =
En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954,
-
[79]
Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning
Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, et al. Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning. arXiv preprint arXiv:2507.22607, 2025. 1, 2, 5, 7
-
[80]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InCVPR, 2024. 9
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.