SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

Chenheng Zhang; Guojun Yin; Jiajun Chai; Qi Zhang; Runyu Zhou; Stefanie Jegelka; Wei Lin; Xiaohan Wang; Xiaojun Guo; Yifei Wang

arxiv: 2510.16416 · v4 · pith:URTZ6GV2new · submitted 2025-10-18 · 💻 cs.CV · cs.AI

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

Xiaojun Guo , Runyu Zhou , Yifei Wang , Qi Zhang , Chenheng Zhang , Stefanie Jegelka , Xiaohan Wang , Jiajun Chai

show 3 more authors

Guojun Yin Wei Lin Yisen Wang

This is my paper

Pith reviewed 2026-05-21 20:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords self-supervised learningreinforcement learningvision-language modelsintrinsic rewardsmultimodal reasoningfine-tuning

0 comments

The pith

Self-supervised learning objectives can be reformulated as automatic reward signals for reinforcing vision-language model reasoning without external evaluators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework called SSL4RL that uses self-supervised learning tasks to generate intrinsic rewards for reinforcement learning applied to vision-language models. This addresses the problem of unreliable or expensive reward mechanisms by turning tasks like predicting image rotations or reconstructing masked patches into dense, verifiable reward signals. Experiments demonstrate that this leads to substantial performance gains on both vision-centric tasks and more complex vision-language reasoning benchmarks. Ablations show that factors like task difficulty, model scale, and semantic alignment play important roles in determining success. The method also generalizes to improving graph learning performance.

Core claim

SSL4RL reformulates common self-supervised learning objectives into dense automatic reward signals that can be used for RL-based fine-tuning of vision-language models, eliminating the need for human preference data or AI evaluators and leading to improved results on reasoning benchmarks.

What carries the argument

Reformulation of SSL objectives such as image rotation prediction and masked patch reconstruction into dense, automatic reward signals for RL fine-tuning of VLMs.

If this is right

VLMs show improved performance on vision-centric and vision-language reasoning tasks after RL fine-tuning with SSL rewards.
No human preference data or external evaluators are required to generate the rewards.
The effectiveness of the approach depends on selecting SSL tasks with appropriate difficulty and alignment to the target domain.
The framework can be applied to other domains such as graph learning with similar benefits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could lower the barrier to applying RL in multimodal settings by removing dependency on costly data collection.
Researchers might investigate optimal combinations of different SSL tasks for reward design.
Scaling the method to even larger models could test whether the benefits persist or amplify.

Load-bearing premise

Self-supervised learning objectives can be reformulated into dense, automatic reward signals that align with and improve desired vision-language reasoning behaviors in VLMs.

What would settle it

A controlled experiment showing that fine-tuning a VLM using these SSL-derived rewards produces no significant improvement or even worse performance on standard vision-language reasoning benchmarks compared to a baseline without such RL.

Figures

Figures reproduced from arXiv: 2510.16416 by Chenheng Zhang, Guojun Yin, Jiajun Chai, Qi Zhang, Runyu Zhou, Stefanie Jegelka, Wei Lin, Xiaohan Wang, Xiaojun Guo, Yifei Wang, Yisen Wang.

**Figure 2.** Figure 2: Four SSL4RL tasks considered in our study. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-Attention Heatmap Comparison. More instances are shown in Appendix L. Observations. Through a qualitative analysis of model responses, we observe two key improvements attributable to SSL4RL training. (1) Sharper Attention: The trained models exhibit more precise attention alignment with text queries. For instance, when queried about “hair” ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Test accuracy (%) on MMBench, varying the SSL task difficulty. The impact of task difficulty varies considerably across different SSL tasks, as evidenced by the results in Figure 4 (full results in Appendix C). The performances for the Contrastive task show a marked improvement upon increasing its difficulty, elevated from 69.27% to 77.89% on MMBench and from 61.90% to 65.00% on SEED-Bench. Conversely, t… view at source ↗

**Figure 5.** Figure 5: Test accuracy of SSL tasks. The preceding sections primarily investigate the effect of individual SSL rewards. A natural subsequent question is whether combining them during training can yield better performance compared to any single reward. To explore this, we train the Qwen2.5-VL-3BInstruct model using a combination of all four SSL rewards. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Perturbation examples on MMBench. rating an additional multi-crop strategy. This strategy generates two 224×224 pixel crops per image via randomly resized cropping (scale range [0.08, 1.0]), presenting a more significant challenge to the model’s perception. Examples of these perturbed images are provided in [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Illustrations of Harder SSL4RL Task: Mask and Hard-Negative [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Reward curves of SSR4RL models during reinforcement learning. [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Rewards of SSL4RL 3B-models on MMBench, comparing different difficulties. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Entropy of SSR4RL models during reinforcement learning. [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: Response Lengths of SSR4RL models during reinforcement learning. [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Instances of VLMs’ loss on image information. After receiving textual instructions, [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: illustrate that our models consistently display more focused attention towards the regions in the images corresponding to the selected token, which indirectly proves the superior performance of our models. (a) (b) (c) (d) (e) [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SSL4RL turns common SSL tasks into RL rewards for VLMs and reports benchmark gains, but the abstract leaves the mechanism and controls too vague to judge if the gains truly come from better reasoning alignment.

read the letter

The main thing to know is that this paper takes standard self-supervised tasks like rotation prediction or masked reconstruction and converts them into dense, automatic rewards for RL fine-tuning of vision-language models. The goal is to push VLMs to use visual evidence more reliably instead of falling back on language priors or shortcuts. They report improvements on vision-centric and vision-language reasoning benchmarks plus some ablations on task difficulty and semantic alignment, and they show the idea also works on graphs.

Referee Report

2 major / 2 minor

Summary. The paper proposes SSL4RL, a framework that reformulates self-supervised learning objectives such as image rotation prediction and masked patch reconstruction into dense, automatic reward signals for RL-based fine-tuning of vision-language models. The central claim is that this approach improves VLMs' utilization of visual evidence on vision-centric and vision-language reasoning benchmarks without human preferences or AI evaluators. Systematic ablations identify influencing factors including task difficulty, model scale, and semantic alignment, and the method is demonstrated on graph learning tasks.

Significance. If the results hold, SSL4RL provides a scalable, verifiable alternative to human or LLM-based rewards for multimodal alignment, with the ablations offering concrete design principles for selecting SSL tasks. The extension to graph learning demonstrates generality beyond VLMs. These elements strengthen the contribution if experimental attribution is clarified.

major comments (2)

[Experiments] Experiments section: The reported benchmark improvements lack details on statistical significance testing, exact baseline implementations, data splits, and control conditions that isolate the SSL reward signal from generic RL training effects or additional optimization steps. This is load-bearing for the central claim that gains arise specifically from the reformulated intrinsic rewards.
[Ablations] Ablations on semantic alignment and task difficulty: No control experiment decouples the SSL objective (e.g., rotation prediction succeeding on texture statistics) from the target VL reasoning distribution (e.g., object relations). Without this, it remains possible that observed gains reflect low-level visual regularization rather than improved high-level reasoning alignment.

minor comments (2)

[Method] The method section would benefit from an explicit equation or pseudocode showing the precise mapping from SSL loss to per-step reward value.
Figure captions for benchmark results should include error bars or run counts to aid interpretation of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate the changes we will make to strengthen the experimental rigor and clarity of the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The reported benchmark improvements lack details on statistical significance testing, exact baseline implementations, data splits, and control conditions that isolate the SSL reward signal from generic RL training effects or additional optimization steps. This is load-bearing for the central claim that gains arise specifically from the reformulated intrinsic rewards.

Authors: We agree that these details are essential for supporting the central claim. In the revised manuscript we will expand the Experiments section to report results over multiple random seeds with means and standard deviations for statistical significance. We will also provide precise descriptions of baseline implementations (including hyperparameter settings and training protocols), explicitly state the data splits used, and add control experiments that apply standard RL fine-tuning without the SSL-derived rewards. These controls will help isolate the contribution of the reformulated intrinsic rewards from generic optimization effects. revision: yes
Referee: [Ablations] Ablations on semantic alignment and task difficulty: No control experiment decouples the SSL objective (e.g., rotation prediction succeeding on texture statistics) from the target VL reasoning distribution (e.g., object relations). Without this, it remains possible that observed gains reflect low-level visual regularization rather than improved high-level reasoning alignment.

Authors: We acknowledge the value of a more targeted control to separate low-level visual statistics from semantic alignment with reasoning tasks. In the revision we will introduce an additional ablation that applies SSL objectives to inputs with disrupted high-level structure (e.g., texture-preserving but relation-disrupting transformations) and compare performance against the original semantically aligned tasks on the VL reasoning benchmarks. The results and discussion will be added to the Ablations section to clarify whether gains arise primarily from high-level alignment or from general visual regularization. revision: yes

Circularity Check

0 steps flagged

No circularity: SSL rewards derived from independent standard objectives with empirical validation

full rationale

The paper introduces SSL4RL by reformulating established self-supervised tasks (rotation prediction, masked reconstruction) into RL reward signals for VLMs. These objectives pre-exist the target reasoning benchmarks and are not fitted or redefined in terms of the claimed performance gains. Experiments and ablations on benchmarks, task difficulty, and semantic alignment provide external validation rather than reducing the result to a self-referential fit or self-citation chain. No equations or derivations collapse the output to the input by construction, and the framework is presented as a practical paradigm with generality shown on graph learning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that SSL objectives provide aligned and verifiable rewards suitable for guiding VLM reasoning via RL, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (1)

domain assumption SSL tasks such as rotation prediction and masked patch reconstruction can be reformulated into dense automatic reward signals that align with target VLM reasoning behaviors
This premise is invoked as the core mechanism enabling the SSL4RL framework.

pith-pipeline@v0.9.0 · 5792 in / 1204 out tokens · 45494 ms · 2026-05-21T20:20:27.238246+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach reformulates SSL objectives—such as predicting image rotation or reconstructing masked patches—into dense, automatic reward signals... r=1[ŷ=y]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt Grouped Reinforcement Policy Optimization (GRPO)... J(θ)=E[R(τ)−βKL...]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
Boosting Visual Instruction Tuning with Self-Supervised Guidance
cs.CV 2026-04 unverdicted novelty 6.0

Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
Visually-Guided Policy Optimization for Multimodal Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 3 Pith papers · 24 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jason Kernion, Jackson Jones, Andy Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Con- stitutional ai: Harmlessness from ai feedback.arXiv:2212.08073, 2022. URLhttps: //arxiv.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV,

work page
[4]

URLhttps://arxiv.org/abs/2104.14294

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Masked thought: Simply masking partial reasoning steps can im- prove mathematical reasoning learning of language models

Changyu Chen, Xiting Wang, Ting-En Lin, Ang Lv, Yuchuan Wu, Xin Gao, Ji-Rong Wen, Rui Yan, and Yongbin Li. Masked thought: Simply masking partial reasoning steps can im- prove mathematical reasoning learning of language models. InACL, 2024. URLhttps: //arxiv.org/abs/2403.02178

work page arXiv 2024
[6]

Humans or llms as the judge? a study on judgement bias

Guanting Chen et al. Humans or llms as the judge? a study on judgement bias. InEMNLP,

work page
[7]

URLhttps://aclanthology.org/2024.emnlp-main.474.pdf

work page 2024
[8]

Mixed au- toencoder for self-supervised visual representation learning

Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, and Dit-Yan Yeung. Mixed au- toencoder for self-supervised visual representation learning. InCVPR, 2023

work page 2023
[9]

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. URLhttps://arxiv. org/abs/2002.05709

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

Context autoencoder for self-supervised representation learning.International Journal of Computer Vision, 132(1):208–223, 2024

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context autoencoder for self-supervised representation learning.International Journal of Computer Vision, 132(1):208–223, 2024

work page 2024
[11]

Caparena: Benchmarking and analyzing detailed image captioning in the llm era

Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, Chenyang Yan, Nuo Chen, Jianbing Zhang, and Jiajun Chen. Caparena: Benchmarking and analyzing detailed image captioning in the llm era. InACL Findings, 2025

work page 2025
[12]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

work page 2009
[14]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNAACL, 2019. URLhttps: //arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019
[15]

Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learn- ing by context prediction. InICCV, 2015. URLhttps://arxiv.org/abs/1505. 05192

work page 2015
[16]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InACM MM, 2024

work page 2024
[17]

Taglas: An atlas of text-attributed graph datasets in the era of large graph and language models.arXiv preprint arXiv:2406.14683, 2024

Jiarui Feng, Hao Liu, Lecheng Kong, Mingfang Zhu, Yixin Chen, and Muhan Zhang. Taglas: An atlas of text-attributed graph datasets in the era of large graph and language models.arXiv preprint arXiv:2406.14683, 2024

work page arXiv 2024
[18]

Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008, 2025

Stephanie Fu, Tyler Bonnen, Devin Guillory, and Trevor Darrell. Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008, 2025

work page arXiv 2025
[19]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InECCV, 2024

work page 2024
[20]

Robust con- trastive learning using negative samples with diminished semantics

Songwei Ge, Shlok Mishra, Chun-Liang Li, Haohan Wang, and David Jacobs. Robust con- trastive learning using negative samples with diminished semantics. InNeurIPS, 2021

work page 2021
[21]

Unsupervised representation learning by predicting image rotations

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. InICLR, 2018. URLhttps://arxiv.org/abs/1803. 07728

work page 2018
[22]

Bootstrap your own latent: A new approach to self-supervised learn- ing

Jean-Bastien Grill et al. Bootstrap your own latent: A new approach to self-supervised learn- ing. InNeurIPS, 2020. URLhttps://arxiv.org/abs/2006.07733

work page arXiv 2020
[23]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InCVPR, 2020. URLhttps://arxiv.org/ abs/1911.05722

work page arXiv 2020
[24]

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. URLhttps://arxiv.org/ abs/2111.06377

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Graphmae: Self-supervised masked graph autoencoders

Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. Graphmae: Self-supervised masked graph autoencoders. InKDD, 2022. URLhttps: //arxiv.org/abs/2205.10803

work page arXiv 2022
[27]

arXiv preprint arXiv:1905.12265 , year=

Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. InICLR, 2020. URLhttps: //arxiv.org/abs/1905.12265

work page arXiv 2020
[28]

Visual robustness benchmark for visual ques- tion answering (vqa)

Md Farhan Ishmam, Ishmam Tashdeed, Talukder Asir Saadat, Md Hamjajul Ashmafee, Abu Raihan Mostofa Kamal, and Md Azam Hossain. Visual robustness benchmark for visual ques- tion answering (vqa). InWACV, 2025

work page 2025
[29]

Look again, think slowly: Enhancing visual reflection in vision-language models

Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, and Jiajun Zhang. Look again, think slowly: Enhancing visual reflection in vision-language models. InEMNLP, 2025. 13

work page 2025
[30]

Self-supervised learning on graphs: Deep insights and new direction.arXiv preprint arXiv:2006.10141, 2020

Wei Jin, Tyler Derr, Haochen Liu, Yiqi Wang, Suhang Wang, Zitao Liu, and Jiliang Tang. Self-supervised learning on graphs: Deep insights and new direction.arXiv preprint arXiv:2006.10141, 2020

work page arXiv 2006
[31]

Hard negative mixing for contrastive learning

Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning. InNeurIPS, 2020

work page 2020
[32]

Variational Graph Auto-Encoders

Thomas N Kipf and Max Welling. Variational graph auto-encoders.arXiv preprint arXiv:1611.07308, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

Gotmare, Silvio Savarese, and Steven C.H

Hung Le, Yue Wang, Akhilesh D. Gotmare, Silvio Savarese, and Steven C.H. Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In NeurIPS, 2022. URLhttps://arxiv.org/abs/2207.01780

work page arXiv 2022
[34]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. InACL, 2020. URLhttps://arxiv.org/abs/1910.13461

work page internal anchor Pith review Pith/arXiv arXiv 2020
[35]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Selvaraju, Akhilesh D

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh D. Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. InNeurIPS, 2021. URLhttps://arxiv.org/abs/2107.07651

work page arXiv 2021
[38]

URLhttps://arxiv.org/abs/2505.07185

work page arXiv
[39]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

work page 2024
[40]

Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self- supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

work page arXiv 2025
[41]

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. InECCV, 2016. URLhttps://arxiv.org/abs/1603.09246

work page internal anchor Pith review Pith/arXiv arXiv 2016
[42]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML,

work page
[45]

URLhttps://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023. URLhttps://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 2020. URLhttps://arxiv.org/abs/1910.10683. 14

work page internal anchor Pith review Pith/arXiv arXiv 2020
[48]

Is llm-as-a-judge robust? investigating universal adversarial attacks on zero- shot llm assessment

Vasu Raina et al. Is llm-as-a-judge robust? investigating universal adversarial attacks on zero- shot llm assessment. InEMNLP, 2024. URLhttps://aclanthology.org/2024. emnlp-main.427.pdf

work page 2024
[49]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhoujun Shao et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models via grpo.arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1- style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS,

work page
[53]

URLhttps://arxiv.org/abs/2303.11366

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Mer- hej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, Louis Rouil- lard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Ga¨el Liu, Francesco Visin, Kathleen Kenealy, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InNeurIPS, 2024. 15

work page 2024
[56]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang et al. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023. URLhttps://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InACL, 2023. URLhttps://arxiv.org/abs/2212.10560

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Blaschko

Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, and Matthew B. Blaschko. Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puzzles, 2025

work page 2025
[59]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InCVPR, 2024

work page 2024
[60]

Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, April 2024

xAI. Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, April 2024. URL https://x.ai/news/grok-1.5v. Accessed: [Today’s Date]

work page 2024
[61]

Hard negative examples are hard, but useful

Hong Xuan, Abby Stylianou, Xiaotong Liu, and Robert Pless. Hard negative examples are hard, but useful. InECCV, 2020

work page 2020
[62]

Graph contrastive learning with augmentations

Yonglong You, Tianlong Chen, Zhangyang Sui, and Yang Wang. Graph contrastive learning with augmentations. InNeurIPS, 2020. URLhttps://proceedings.neurips.cc/ paper/2020/file/3fe230348e9a12c13120749e3f9fa4cd-Paper.pdf

work page 2020
[63]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InNeurIPS, 2022. URLhttps://arxiv.org/abs/2203.14465

work page arXiv 2022
[64]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multi- modal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 16 The Use of Large Language Models (LLMs) In this work, LLMs ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

3,1,9,2,8,5,4,6,7

Brown fur covering most of its body: … the overall appearance suggests it had brown fur. 3. Long legs: The fossil shows long legs... Given these observations, the trait that can be clearly observed on the fossil is the rounded ears. Response: To determine which trait Ursus spelaeus had based on the fossil, let's analyze the given options: A. Rounded ears ...

work page

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jason Kernion, Jackson Jones, Andy Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Con- stitutional ai: Harmlessness from ai feedback.arXiv:2212.08073, 2022. URLhttps: //arxiv.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV,

work page

[4] [4]

URLhttps://arxiv.org/abs/2104.14294

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Masked thought: Simply masking partial reasoning steps can im- prove mathematical reasoning learning of language models

Changyu Chen, Xiting Wang, Ting-En Lin, Ang Lv, Yuchuan Wu, Xin Gao, Ji-Rong Wen, Rui Yan, and Yongbin Li. Masked thought: Simply masking partial reasoning steps can im- prove mathematical reasoning learning of language models. InACL, 2024. URLhttps: //arxiv.org/abs/2403.02178

work page arXiv 2024

[6] [6]

Humans or llms as the judge? a study on judgement bias

Guanting Chen et al. Humans or llms as the judge? a study on judgement bias. InEMNLP,

work page

[7] [7]

URLhttps://aclanthology.org/2024.emnlp-main.474.pdf

work page 2024

[8] [8]

Mixed au- toencoder for self-supervised visual representation learning

Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, and Dit-Yan Yeung. Mixed au- toencoder for self-supervised visual representation learning. InCVPR, 2023

work page 2023

[9] [9]

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. URLhttps://arxiv. org/abs/2002.05709

work page internal anchor Pith review Pith/arXiv arXiv 2020

[10] [10]

Context autoencoder for self-supervised representation learning.International Journal of Computer Vision, 132(1):208–223, 2024

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context autoencoder for self-supervised representation learning.International Journal of Computer Vision, 132(1):208–223, 2024

work page 2024

[11] [11]

Caparena: Benchmarking and analyzing detailed image captioning in the llm era

Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, Chenyang Yan, Nuo Chen, Jianbing Zhang, and Jiajun Chen. Caparena: Benchmarking and analyzing detailed image captioning in the llm era. InACL Findings, 2025

work page 2025

[12] [12]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009

work page 2009

[14] [14]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNAACL, 2019. URLhttps: //arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019

[15] [15]

Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learn- ing by context prediction. InICCV, 2015. URLhttps://arxiv.org/abs/1505. 05192

work page 2015

[16] [16]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InACM MM, 2024

work page 2024

[17] [17]

Taglas: An atlas of text-attributed graph datasets in the era of large graph and language models.arXiv preprint arXiv:2406.14683, 2024

Jiarui Feng, Hao Liu, Lecheng Kong, Mingfang Zhu, Yixin Chen, and Muhan Zhang. Taglas: An atlas of text-attributed graph datasets in the era of large graph and language models.arXiv preprint arXiv:2406.14683, 2024

work page arXiv 2024

[18] [18]

Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008, 2025

Stephanie Fu, Tyler Bonnen, Devin Guillory, and Trevor Darrell. Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008, 2025

work page arXiv 2025

[19] [19]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InECCV, 2024

work page 2024

[20] [20]

Robust con- trastive learning using negative samples with diminished semantics

Songwei Ge, Shlok Mishra, Chun-Liang Li, Haohan Wang, and David Jacobs. Robust con- trastive learning using negative samples with diminished semantics. InNeurIPS, 2021

work page 2021

[21] [21]

Unsupervised representation learning by predicting image rotations

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. InICLR, 2018. URLhttps://arxiv.org/abs/1803. 07728

work page 2018

[22] [22]

Bootstrap your own latent: A new approach to self-supervised learn- ing

Jean-Bastien Grill et al. Bootstrap your own latent: A new approach to self-supervised learn- ing. InNeurIPS, 2020. URLhttps://arxiv.org/abs/2006.07733

work page arXiv 2020

[23] [23]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InCVPR, 2020. URLhttps://arxiv.org/ abs/1911.05722

work page arXiv 2020

[24] [24]

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. URLhttps://arxiv.org/ abs/2111.06377

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

Graphmae: Self-supervised masked graph autoencoders

Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. Graphmae: Self-supervised masked graph autoencoders. InKDD, 2022. URLhttps: //arxiv.org/abs/2205.10803

work page arXiv 2022

[26] [27]

arXiv preprint arXiv:1905.12265 , year=

Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. InICLR, 2020. URLhttps: //arxiv.org/abs/1905.12265

work page arXiv 2020

[27] [28]

Visual robustness benchmark for visual ques- tion answering (vqa)

Md Farhan Ishmam, Ishmam Tashdeed, Talukder Asir Saadat, Md Hamjajul Ashmafee, Abu Raihan Mostofa Kamal, and Md Azam Hossain. Visual robustness benchmark for visual ques- tion answering (vqa). InWACV, 2025

work page 2025

[28] [29]

Look again, think slowly: Enhancing visual reflection in vision-language models

Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, and Jiajun Zhang. Look again, think slowly: Enhancing visual reflection in vision-language models. InEMNLP, 2025. 13

work page 2025

[29] [30]

Self-supervised learning on graphs: Deep insights and new direction.arXiv preprint arXiv:2006.10141, 2020

Wei Jin, Tyler Derr, Haochen Liu, Yiqi Wang, Suhang Wang, Zitao Liu, and Jiliang Tang. Self-supervised learning on graphs: Deep insights and new direction.arXiv preprint arXiv:2006.10141, 2020

work page arXiv 2006

[30] [31]

Hard negative mixing for contrastive learning

Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning. InNeurIPS, 2020

work page 2020

[31] [32]

Variational Graph Auto-Encoders

Thomas N Kipf and Max Welling. Variational graph auto-encoders.arXiv preprint arXiv:1611.07308, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[32] [33]

Gotmare, Silvio Savarese, and Steven C.H

Hung Le, Yue Wang, Akhilesh D. Gotmare, Silvio Savarese, and Steven C.H. Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In NeurIPS, 2022. URLhttps://arxiv.org/abs/2207.01780

work page arXiv 2022

[33] [34]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. InACL, 2020. URLhttps://arxiv.org/abs/1910.13461

work page internal anchor Pith review Pith/arXiv arXiv 2020

[34] [35]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [36]

Selvaraju, Akhilesh D

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh D. Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. InNeurIPS, 2021. URLhttps://arxiv.org/abs/2107.07651

work page arXiv 2021

[36] [38]

URLhttps://arxiv.org/abs/2505.07185

work page arXiv

[37] [39]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

work page 2024

[38] [40]

Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self- supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

work page arXiv 2025

[39] [41]

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. InECCV, 2016. URLhttps://arxiv.org/abs/1603.09246

work page internal anchor Pith review Pith/arXiv arXiv 2016

[40] [42]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [43]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page internal anchor Pith review Pith/arXiv arXiv 2022

[42] [44]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML,

work page

[43] [45]

URLhttps://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv

[44] [46]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023. URLhttps://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [47]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 2020. URLhttps://arxiv.org/abs/1910.10683. 14

work page internal anchor Pith review Pith/arXiv arXiv 2020

[46] [48]

Is llm-as-a-judge robust? investigating universal adversarial attacks on zero- shot llm assessment

Vasu Raina et al. Is llm-as-a-judge robust? investigating universal adversarial attacks on zero- shot llm assessment. InEMNLP, 2024. URLhttps://aclanthology.org/2024. emnlp-main.427.pdf

work page 2024

[47] [49]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[48] [50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhoujun Shao et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models via grpo.arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [51]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1- style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [52]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS,

work page

[51] [53]

URLhttps://arxiv.org/abs/2303.11366

work page internal anchor Pith review Pith/arXiv arXiv

[52] [54]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Mer- hej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, Louis Rouil- lard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Ga¨el Liu, Francesco Visin, Kathleen Kenealy, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [55]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InNeurIPS, 2024. 15

work page 2024

[54] [56]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang et al. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023. URLhttps://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [57]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InACL, 2023. URLhttps://arxiv.org/abs/2212.10560

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [58]

Blaschko

Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, and Matthew B. Blaschko. Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puzzles, 2025

work page 2025

[57] [59]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InCVPR, 2024

work page 2024

[58] [60]

Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, April 2024

xAI. Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, April 2024. URL https://x.ai/news/grok-1.5v. Accessed: [Today’s Date]

work page 2024

[59] [61]

Hard negative examples are hard, but useful

Hong Xuan, Abby Stylianou, Xiaotong Liu, and Robert Pless. Hard negative examples are hard, but useful. InECCV, 2020

work page 2020

[60] [62]

Graph contrastive learning with augmentations

Yonglong You, Tianlong Chen, Zhangyang Sui, and Yang Wang. Graph contrastive learning with augmentations. InNeurIPS, 2020. URLhttps://proceedings.neurips.cc/ paper/2020/file/3fe230348e9a12c13120749e3f9fa4cd-Paper.pdf

work page 2020

[61] [63]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InNeurIPS, 2022. URLhttps://arxiv.org/abs/2203.14465

work page arXiv 2022

[62] [64]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multi- modal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 16 The Use of Large Language Models (LLMs) In this work, LLMs ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [65]

3,1,9,2,8,5,4,6,7

Brown fur covering most of its body: … the overall appearance suggests it had brown fur. 3. Long legs: The fossil shows long legs... Given these observations, the trait that can be clearly observed on the fossil is the rounded ears. Response: To determine which trait Ursus spelaeus had based on the fossil, let's analyze the given options: A. Rounded ears ...

work page