pith. sign in

arxiv: 2510.16416 · v4 · pith:URTZ6GV2new · submitted 2025-10-18 · 💻 cs.CV · cs.AI

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

Pith reviewed 2026-05-21 20:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords self-supervised learningreinforcement learningvision-language modelsintrinsic rewardsmultimodal reasoningfine-tuning
0
0 comments X

The pith

Self-supervised learning objectives can be reformulated as automatic reward signals for reinforcing vision-language model reasoning without external evaluators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework called SSL4RL that uses self-supervised learning tasks to generate intrinsic rewards for reinforcement learning applied to vision-language models. This addresses the problem of unreliable or expensive reward mechanisms by turning tasks like predicting image rotations or reconstructing masked patches into dense, verifiable reward signals. Experiments demonstrate that this leads to substantial performance gains on both vision-centric tasks and more complex vision-language reasoning benchmarks. Ablations show that factors like task difficulty, model scale, and semantic alignment play important roles in determining success. The method also generalizes to improving graph learning performance.

Core claim

SSL4RL reformulates common self-supervised learning objectives into dense automatic reward signals that can be used for RL-based fine-tuning of vision-language models, eliminating the need for human preference data or AI evaluators and leading to improved results on reasoning benchmarks.

What carries the argument

Reformulation of SSL objectives such as image rotation prediction and masked patch reconstruction into dense, automatic reward signals for RL fine-tuning of VLMs.

If this is right

  • VLMs show improved performance on vision-centric and vision-language reasoning tasks after RL fine-tuning with SSL rewards.
  • No human preference data or external evaluators are required to generate the rewards.
  • The effectiveness of the approach depends on selecting SSL tasks with appropriate difficulty and alignment to the target domain.
  • The framework can be applied to other domains such as graph learning with similar benefits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could lower the barrier to applying RL in multimodal settings by removing dependency on costly data collection.
  • Researchers might investigate optimal combinations of different SSL tasks for reward design.
  • Scaling the method to even larger models could test whether the benefits persist or amplify.

Load-bearing premise

Self-supervised learning objectives can be reformulated into dense, automatic reward signals that align with and improve desired vision-language reasoning behaviors in VLMs.

What would settle it

A controlled experiment showing that fine-tuning a VLM using these SSL-derived rewards produces no significant improvement or even worse performance on standard vision-language reasoning benchmarks compared to a baseline without such RL.

Figures

Figures reproduced from arXiv: 2510.16416 by Chenheng Zhang, Guojun Yin, Jiajun Chai, Qi Zhang, Runyu Zhou, Stefanie Jegelka, Wei Lin, Xiaohan Wang, Xiaojun Guo, Yifei Wang, Yisen Wang.

Figure 1
Figure 1. Figure 1: Overview of the SSL4RL framework. A corruption function transforms an input into [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Four SSL4RL tasks considered in our study. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-Attention Heatmap Comparison. More instances are shown in Appendix L. Observations. Through a qualitative analy￾sis of model responses, we observe two key improvements attributable to SSL4RL training. (1) Sharper Attention: The trained models exhibit more precise attention alignment with text queries. For instance, when queried about “hair” ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Test accuracy (%) on MM￾Bench, varying the SSL task difficulty. The impact of task difficulty varies considerably across different SSL tasks, as evidenced by the results in Fig￾ure 4 (full results in Appendix C). The performances for the Contrastive task show a marked improvement upon increasing its difficulty, elevated from 69.27% to 77.89% on MMBench and from 61.90% to 65.00% on SEED-Bench. Conversely, t… view at source ↗
Figure 5
Figure 5. Figure 5: Test accuracy of SSL tasks. The preceding sections primarily investigate the effect of individual SSL rewards. A natural subsequent ques￾tion is whether combining them during training can yield better performance compared to any single re￾ward. To explore this, we train the Qwen2.5-VL-3B￾Instruct model using a combination of all four SSL re￾wards. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Perturbation examples on MMBench. rating an additional multi-crop strategy. This strategy generates two 224×224 pixel crops per image via randomly resized cropping (scale range [0.08, 1.0]), presenting a more significant challenge to the model’s perception. Examples of these perturbed images are provided in [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustrations of Harder SSL4RL Task: Mask and Hard-Negative [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reward curves of SSR4RL models during reinforcement learning. [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Rewards of SSL4RL 3B-models on MMBench, comparing different difficulties. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Entropy of SSR4RL models during reinforcement learning. [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Response Lengths of SSR4RL models during reinforcement learning. [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Instances of VLMs’ loss on image information. After receiving textual instructions, [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: illustrate that our models consistently display more focused attention towards the regions in the images corresponding to the selected token, which indirectly proves the superior performance of our models. (a) (b) (c) (d) (e) [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
read the original abstract

Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SSL4RL, a framework that reformulates self-supervised learning objectives such as image rotation prediction and masked patch reconstruction into dense, automatic reward signals for RL-based fine-tuning of vision-language models. The central claim is that this approach improves VLMs' utilization of visual evidence on vision-centric and vision-language reasoning benchmarks without human preferences or AI evaluators. Systematic ablations identify influencing factors including task difficulty, model scale, and semantic alignment, and the method is demonstrated on graph learning tasks.

Significance. If the results hold, SSL4RL provides a scalable, verifiable alternative to human or LLM-based rewards for multimodal alignment, with the ablations offering concrete design principles for selecting SSL tasks. The extension to graph learning demonstrates generality beyond VLMs. These elements strengthen the contribution if experimental attribution is clarified.

major comments (2)
  1. [Experiments] Experiments section: The reported benchmark improvements lack details on statistical significance testing, exact baseline implementations, data splits, and control conditions that isolate the SSL reward signal from generic RL training effects or additional optimization steps. This is load-bearing for the central claim that gains arise specifically from the reformulated intrinsic rewards.
  2. [Ablations] Ablations on semantic alignment and task difficulty: No control experiment decouples the SSL objective (e.g., rotation prediction succeeding on texture statistics) from the target VL reasoning distribution (e.g., object relations). Without this, it remains possible that observed gains reflect low-level visual regularization rather than improved high-level reasoning alignment.
minor comments (2)
  1. [Method] The method section would benefit from an explicit equation or pseudocode showing the precise mapping from SSL loss to per-step reward value.
  2. Figure captions for benchmark results should include error bars or run counts to aid interpretation of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate the changes we will make to strengthen the experimental rigor and clarity of the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The reported benchmark improvements lack details on statistical significance testing, exact baseline implementations, data splits, and control conditions that isolate the SSL reward signal from generic RL training effects or additional optimization steps. This is load-bearing for the central claim that gains arise specifically from the reformulated intrinsic rewards.

    Authors: We agree that these details are essential for supporting the central claim. In the revised manuscript we will expand the Experiments section to report results over multiple random seeds with means and standard deviations for statistical significance. We will also provide precise descriptions of baseline implementations (including hyperparameter settings and training protocols), explicitly state the data splits used, and add control experiments that apply standard RL fine-tuning without the SSL-derived rewards. These controls will help isolate the contribution of the reformulated intrinsic rewards from generic optimization effects. revision: yes

  2. Referee: [Ablations] Ablations on semantic alignment and task difficulty: No control experiment decouples the SSL objective (e.g., rotation prediction succeeding on texture statistics) from the target VL reasoning distribution (e.g., object relations). Without this, it remains possible that observed gains reflect low-level visual regularization rather than improved high-level reasoning alignment.

    Authors: We acknowledge the value of a more targeted control to separate low-level visual statistics from semantic alignment with reasoning tasks. In the revision we will introduce an additional ablation that applies SSL objectives to inputs with disrupted high-level structure (e.g., texture-preserving but relation-disrupting transformations) and compare performance against the original semantically aligned tasks on the VL reasoning benchmarks. The results and discussion will be added to the Ablations section to clarify whether gains arise primarily from high-level alignment or from general visual regularization. revision: yes

Circularity Check

0 steps flagged

No circularity: SSL rewards derived from independent standard objectives with empirical validation

full rationale

The paper introduces SSL4RL by reformulating established self-supervised tasks (rotation prediction, masked reconstruction) into RL reward signals for VLMs. These objectives pre-exist the target reasoning benchmarks and are not fitted or redefined in terms of the claimed performance gains. Experiments and ablations on benchmarks, task difficulty, and semantic alignment provide external validation rather than reducing the result to a self-referential fit or self-citation chain. No equations or derivations collapse the output to the input by construction, and the framework is presented as a practical paradigm with generality shown on graph learning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that SSL objectives provide aligned and verifiable rewards suitable for guiding VLM reasoning via RL, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (1)
  • domain assumption SSL tasks such as rotation prediction and masked patch reconstruction can be reformulated into dense automatic reward signals that align with target VLM reasoning behaviors
    This premise is invoked as the core mechanism enabling the SSL4RL framework.

pith-pipeline@v0.9.0 · 5792 in / 1204 out tokens · 45494 ms · 2026-05-21T20:20:27.238246+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

  2. Boosting Visual Instruction Tuning with Self-Supervised Guidance

    cs.CV 2026-04 unverdicted novelty 6.0

    Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.

  3. Visually-Guided Policy Optimization for Multimodal Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 3 Pith papers · 24 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 11

  2. [2]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jason Kernion, Jackson Jones, Andy Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Con- stitutional ai: Harmlessness from ai feedback.arXiv:2212.08073, 2022. URLhttps: //arxiv.org/abs/2212.08073

  3. [3]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV,

  4. [4]

    URLhttps://arxiv.org/abs/2104.14294

  5. [5]

    Masked thought: Simply masking partial reasoning steps can im- prove mathematical reasoning learning of language models

    Changyu Chen, Xiting Wang, Ting-En Lin, Ang Lv, Yuchuan Wu, Xin Gao, Ji-Rong Wen, Rui Yan, and Yongbin Li. Masked thought: Simply masking partial reasoning steps can im- prove mathematical reasoning learning of language models. InACL, 2024. URLhttps: //arxiv.org/abs/2403.02178

  6. [6]

    Humans or llms as the judge? a study on judgement bias

    Guanting Chen et al. Humans or llms as the judge? a study on judgement bias. InEMNLP,

  7. [7]

    URLhttps://aclanthology.org/2024.emnlp-main.474.pdf

  8. [8]

    Mixed au- toencoder for self-supervised visual representation learning

    Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, and Dit-Yan Yeung. Mixed au- toencoder for self-supervised visual representation learning. InCVPR, 2023

  9. [9]

    A Simple Framework for Contrastive Learning of Visual Representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. URLhttps://arxiv. org/abs/2002.05709

  10. [10]

    Context autoencoder for self-supervised representation learning.International Journal of Computer Vision, 132(1):208–223, 2024

    Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context autoencoder for self-supervised representation learning.International Journal of Computer Vision, 132(1):208–223, 2024

  11. [11]

    Caparena: Benchmarking and analyzing detailed image captioning in the llm era

    Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, Chenyang Yan, Nuo Chen, Jianbing Zhang, and Jiajun Chen. Caparena: Benchmarking and analyzing detailed image captioning in the llm era. InACL Findings, 2025

  12. [12]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  13. [13]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

  14. [14]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNAACL, 2019. URLhttps: //arxiv.org/abs/1810.04805

  15. [15]

    Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learn- ing by context prediction. InICCV, 2015. URLhttps://arxiv.org/abs/1505. 05192

  16. [16]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InACM MM, 2024

  17. [17]

    Taglas: An atlas of text-attributed graph datasets in the era of large graph and language models.arXiv preprint arXiv:2406.14683, 2024

    Jiarui Feng, Hao Liu, Lecheng Kong, Mingfang Zhu, Yixin Chen, and Muhan Zhang. Taglas: An atlas of text-attributed graph datasets in the era of large graph and language models.arXiv preprint arXiv:2406.14683, 2024

  18. [18]

    Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008, 2025

    Stephanie Fu, Tyler Bonnen, Devin Guillory, and Trevor Darrell. Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008, 2025

  19. [19]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InECCV, 2024

  20. [20]

    Robust con- trastive learning using negative samples with diminished semantics

    Songwei Ge, Shlok Mishra, Chun-Liang Li, Haohan Wang, and David Jacobs. Robust con- trastive learning using negative samples with diminished semantics. InNeurIPS, 2021

  21. [21]

    Unsupervised representation learning by predicting image rotations

    Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. InICLR, 2018. URLhttps://arxiv.org/abs/1803. 07728

  22. [22]

    Bootstrap your own latent: A new approach to self-supervised learn- ing

    Jean-Bastien Grill et al. Bootstrap your own latent: A new approach to self-supervised learn- ing. InNeurIPS, 2020. URLhttps://arxiv.org/abs/2006.07733

  23. [23]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InCVPR, 2020. URLhttps://arxiv.org/ abs/1911.05722

  24. [24]

    Masked Autoencoders Are Scalable Vision Learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. URLhttps://arxiv.org/ abs/2111.06377

  25. [25]

    Graphmae: Self-supervised masked graph autoencoders

    Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. Graphmae: Self-supervised masked graph autoencoders. InKDD, 2022. URLhttps: //arxiv.org/abs/2205.10803

  26. [27]

    arXiv preprint arXiv:1905.12265 , year=

    Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. InICLR, 2020. URLhttps: //arxiv.org/abs/1905.12265

  27. [28]

    Visual robustness benchmark for visual ques- tion answering (vqa)

    Md Farhan Ishmam, Ishmam Tashdeed, Talukder Asir Saadat, Md Hamjajul Ashmafee, Abu Raihan Mostofa Kamal, and Md Azam Hossain. Visual robustness benchmark for visual ques- tion answering (vqa). InWACV, 2025

  28. [29]

    Look again, think slowly: Enhancing visual reflection in vision-language models

    Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, and Jiajun Zhang. Look again, think slowly: Enhancing visual reflection in vision-language models. InEMNLP, 2025. 13

  29. [30]

    Self-supervised learning on graphs: Deep insights and new direction.arXiv preprint arXiv:2006.10141, 2020

    Wei Jin, Tyler Derr, Haochen Liu, Yiqi Wang, Suhang Wang, Zitao Liu, and Jiliang Tang. Self-supervised learning on graphs: Deep insights and new direction.arXiv preprint arXiv:2006.10141, 2020

  30. [31]

    Hard negative mixing for contrastive learning

    Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning. InNeurIPS, 2020

  31. [32]

    Variational Graph Auto-Encoders

    Thomas N Kipf and Max Welling. Variational graph auto-encoders.arXiv preprint arXiv:1611.07308, 2016

  32. [33]

    Gotmare, Silvio Savarese, and Steven C.H

    Hung Le, Yue Wang, Akhilesh D. Gotmare, Silvio Savarese, and Steven C.H. Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In NeurIPS, 2022. URLhttps://arxiv.org/abs/2207.01780

  33. [34]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. InACL, 2020. URLhttps://arxiv.org/abs/1910.13461

  34. [35]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

  35. [36]

    Selvaraju, Akhilesh D

    Junnan Li, Ramprasaath R. Selvaraju, Akhilesh D. Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. InNeurIPS, 2021. URLhttps://arxiv.org/abs/2107.07651

  36. [38]

    URLhttps://arxiv.org/abs/2505.07185

  37. [39]

    Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

  38. [40]

    Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

    Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self- supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

  39. [41]

    Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

    Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. InECCV, 2016. URLhttps://arxiv.org/abs/1603.09246

  40. [42]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  41. [43]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  42. [44]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML,

  43. [45]

    URLhttps://arxiv.org/abs/2103.00020

  44. [46]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023. URLhttps://arxiv.org/abs/2305.18290

  45. [47]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 2020. URLhttps://arxiv.org/abs/1910.10683. 14

  46. [48]

    Is llm-as-a-judge robust? investigating universal adversarial attacks on zero- shot llm assessment

    Vasu Raina et al. Is llm-as-a-judge robust? investigating universal adversarial attacks on zero- shot llm assessment. InEMNLP, 2024. URLhttps://aclanthology.org/2024. emnlp-main.427.pdf

  47. [49]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  48. [50]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhoujun Shao et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models via grpo.arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/ 2402.03300

  49. [51]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1- style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  50. [52]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS,

  51. [53]

    URLhttps://arxiv.org/abs/2303.11366

  52. [54]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Mer- hej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, Louis Rouil- lard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Ga¨el Liu, Francesco Visin, Kathleen Kenealy, ...

  53. [55]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InNeurIPS, 2024. 15

  54. [56]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang et al. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023. URLhttps://arxiv.org/abs/2203.11171

  55. [57]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InACL, 2023. URLhttps://arxiv.org/abs/2212.10560

  56. [58]

    Blaschko

    Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, and Matthew B. Blaschko. Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puzzles, 2025

  57. [59]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InCVPR, 2024

  58. [60]

    Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, April 2024

    xAI. Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, April 2024. URL https://x.ai/news/grok-1.5v. Accessed: [Today’s Date]

  59. [61]

    Hard negative examples are hard, but useful

    Hong Xuan, Abby Stylianou, Xiaotong Liu, and Robert Pless. Hard negative examples are hard, but useful. InECCV, 2020

  60. [62]

    Graph contrastive learning with augmentations

    Yonglong You, Tianlong Chen, Zhangyang Sui, and Yang Wang. Graph contrastive learning with augmentations. InNeurIPS, 2020. URLhttps://proceedings.neurips.cc/ paper/2020/file/3fe230348e9a12c13120749e3f9fa4cd-Paper.pdf

  61. [63]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InNeurIPS, 2022. URLhttps://arxiv.org/abs/2203.14465

  62. [64]

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multi- modal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 16 The Use of Large Language Models (LLMs) In this work, LLMs ...

  63. [65]

    3,1,9,2,8,5,4,6,7

    Brown fur covering most of its body: … the overall appearance suggests it had brown fur. 3. Long legs: The fossil shows long legs... Given these observations, the trait that can be clearly observed on the fossil is the rounded ears. Response: To determine which trait Ursus spelaeus had based on the fossil, let's analyze the given options: A. Rounded ears ...