arxiv: 2604.12966 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Sophia Sirko-Galouchenko , Monika Wysoczanska , Andrei Bursuc , Nicolas Thome , Spyros Gidaris

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual instruction tuningself-supervised learningmultimodal large language modelsvision-centric taskspretext tasksvisual reasoning

0 comments

The pith

Reformulating self-supervised tasks as instructions improves vision-centric performance in multimodal models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often underuse visual information during instruction tuning because many tasks can be solved with language priors alone. The paper proposes converting classical self-supervised pretext tasks such as rotation prediction, color matching, and cross-view correspondence into natural language image-instruction-response triplets. Adding only 3-10% of these visually grounded examples to the training data consistently raises scores on vision-centric benchmarks. The method requires no human labels, no architecture changes, and no extra training stages, working across models and regimes by shifting the data distribution.

Core claim

By reformulating self-supervised pretext tasks as image-instruction-response triplets that cannot be solved without visual evidence, injecting a small fraction of such instructions during visual instruction tuning yields consistent gains on vision-centric evaluations across multiple models, training regimes, and benchmarks.

What carries the argument

Reformulation of classical self-supervised pretext tasks into image-instruction-response triplets that force reliance on visual input rather than language priors.

If this is right

Vision-centric benchmark scores rise without any model architecture or training procedure changes.
The gains appear across different multimodal models and instruction-tuning regimes.
Only a small fraction of the overall training data needs to consist of the visually grounded instructions.
Adjusting the distribution of instruction data is sufficient to improve visual reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reformulation trick could be tested on other input modalities where models lean on priors.
This points to data composition as a higher-leverage knob than scaling model size for visual tasks.
Extending the set of pretext tasks to include additional visual properties would test the generality of the approach.

Load-bearing premise

The reformulated self-supervised tasks cannot be solved using language priors alone and therefore compel the model to utilize visual evidence.

What would settle it

If models achieve the same performance gains when the self-supervised instructions are replaced by non-visual text-only equivalents, the claim that visual grounding drives the improvement would be falsified.

Figures

Figures reproduced from arXiv: 2604.12966 by Andrei Bursuc, Monika Wysoczanska, Nicolas Thome, Sophia Sirko-Galouchenko, Spyros Gidaris.

**Figure 1.** Figure 1: Visually Grounded Instruction Fine-Tuning V-GIFT. We enhance visual instruction tuning by injecting visually grounded self-supervised tasks as additional instruction-following examples sampled from the instruction-tuning data (left; rotation prediction shown). This simple modification encourages better use of visual information and yields consistent gains on visioncentric benchmarks (right; CVB-2D, POPE… view at source ↗

**Figure 2.** Figure 2: Visually grounded instruction-following tasks reformulated from self-supervised learning (SSL) pretext tasks. (a) Rotation prediction: the model must recognize object orientations and relate it to canonical poses. (b) Point-wise colorization: the model must match grayscale points to their original colors, requiring fine-grained visual discrimination, spatial grounding, and reasoning over local and global … view at source ↗

**Figure 3.** Figure 3: Effect of the SSL injection ratio ρ on vision-centric instruction-following performance for LLaVA-1.5-Qwen2.5-7B (left) and LLaVA-OneVision-1.5 (right) [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Attention map from the Baseline (LLaVA-1.5-Vicuna-7B) and V-GIFT on CV-Bench2D examples. V-GIFT produces more focused and better localized attention on task-relevant objects. Q: Is the camera moving left or right? Baseline: right V-GIFT: left Q: Is the cat beneath the car? Baseline: Yes V-GIFT: No Q: Which point functionally corresponds to REF? Baseline: Point B V-GIFT: Point C Q: How many people are weari… view at source ↗

**Figure 5.** Figure 5: Qualitative examples. We present a few qualitative examples comparing LLaVA-1.5 Qwen-2.5-7B baseline against V-GIFT. Our SSL-inspired tasks yield improvements on the variety of vision oriented skills such as counting, multi-view reasoning and visual reasoning. comparing baseline LLaVA-1.5 Vicuna 7B trained with standard Instruction Tuning dataset and the model trained with V-GIFT. We observe that model t… view at source ↗

**Figure 6.** Figure 6: Examples of the visually grounded self-supervised tasks used during training: colorization point matching, point correspondence, and rotation prediction [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows that adding a small fraction of reformulated classical SSL tasks as natural language instructions during tuning gives consistent gains on vision-centric MLLM benchmarks.

read the letter

The main thing to know is that the authors take standard self-supervised pretext tasks like rotation prediction, color matching, and cross-view correspondence, rephrase them as image-instruction-response triplets, and mix in just 3-10% of this data during visual instruction tuning. They report improvements across several models and benchmarks without any architecture changes or new annotations. The approach is lightweight and directly targets the issue that many tuning tasks can be solved with language priors alone. That is the concrete contribution, and it is easy to understand and replicate in principle. The code release helps here too. What the work does well is keep the intervention minimal while focusing on data distribution rather than model scale or new objectives. The empirical pattern they describe, if it holds, is useful for anyone already running instruction tuning who wants to nudge models toward better visual use without extra stages. The soft spot is the assumption that these reformulated tasks cannot be solved above chance using language priors or dataset statistics. Color associations and typical object orientations are common enough that a language model might guess correctly without the image. The abstract does not include ablations that isolate this, such as running the added tasks with images masked or replaced by noise. If the full paper lacks those controls or shows only marginal drops, the gains could stem from extra data volume or diversity instead of forced visual reasoning. The experiments cover multiple models and regimes, which is positive, but the strength depends on whether baselines, splits, and variance are reported clearly. This is for researchers working on practical improvements to MLLM visual reasoning through data curation. It deserves peer review because the idea is straightforward to test and the reported results, if robust, would be worth knowing even if the exact mechanism needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper claims that augmenting visual instruction tuning with 3-10% reformulated self-supervised pretext tasks (rotation prediction, color matching, cross-view correspondence) expressed as image-instruction-response triplets improves MLLM performance on vision-centric benchmarks. These tasks are asserted to supply supervision that cannot be solved using language priors alone, thereby compelling greater utilization of visual features during tuning. The method requires no annotations, architectural changes, or extra stages, and yields consistent gains across models, regimes, and benchmarks. Code is released at https://github.com/sirkosophia/V-GIFT.

Significance. If the reported gains are robust and specifically attributable to compelled visual grounding rather than data volume or diversity effects, the work provides a lightweight, annotation-free lever for improving visual reasoning in MLLMs. This could influence data curation practices for instruction tuning. The open-source code is a clear strength that supports reproducibility and extension.

major comments (3)

[Abstract, §3] Abstract and §3: The load-bearing assertion that the reformulated SSL tasks 'cannot be solved without relying on visual evidence' is stated but not tested. No ablation evaluates whether a language-only model or a vision-ablated input can solve the tasks above chance (e.g., via common object-color associations for color matching or orientation statistics for rotation). Without this, gains cannot be confidently attributed to visual utilization rather than generic instruction data effects.
[§4, Table 2] §4 and Table 2: Performance tables show improvements on vision-centric evaluations, but lack controls that inject equivalent volumes of non-SSL instructions (random or language-prior-heavy) to isolate the contribution of the visual-grounding mechanism. The 3-10% fraction is presented as key, yet no scaling or volume-matched baseline is reported.
[§4.3] §4.3: While multiple models and benchmarks are evaluated, the manuscript provides no statistical tests, run-to-run variance, or confidence intervals. This weakens the claim of 'consistent' improvements, especially given the small data fraction and potential sensitivity to training hyperparameters.

minor comments (2)

[§2] §2: Related work on SSL in vision-language models is cited, but the discussion of how the proposed reformulation differs from prior uses of pretext tasks in MLLM training could be expanded for clarity.
[Figure 1] Figure 1: The diagram illustrating the data augmentation pipeline is helpful, but the caption should explicitly note the exact percentage of SSL samples used in the illustrated example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the contributions and strengthen the evidence for our claims. We address each major point below and commit to revisions that directly respond to the concerns while preserving the core findings.

read point-by-point responses

Referee: [Abstract, §3] The assertion that reformulated SSL tasks 'cannot be solved without relying on visual evidence' is stated but not tested. No ablation with language-only model or vision-ablated input to check if solvable above chance via priors.

Authors: We agree this explicit test would provide stronger attribution. The tasks were selected precisely because classical SSL literature shows they depend on visual properties (e.g., rotation requires image orientation; cross-view correspondence requires spatial alignment not deducible from text). In the revised manuscript we will add a controlled ablation: (i) a text-only LLM baseline on the same instruction triplets and (ii) a vision-ablated MLLM variant, demonstrating near-chance performance and thereby confirming the visual-grounding requirement. revision: yes
Referee: [§4, Table 2] Performance tables lack controls injecting equivalent volumes of non-SSL instructions (random or language-prior-heavy) to isolate visual-grounding mechanism; no volume-matched or scaling baseline for the 3-10% fraction.

Authors: We acknowledge that a direct volume-matched control would better isolate the mechanism. Our current setup keeps the base instruction data fixed and adds only the SSL fraction, so gains are measured atop identical data volume. In revision we will add a control experiment replacing the SSL triplets with an equal number of randomly sampled or language-prior-heavy instructions drawn from existing VQA-style data, showing that these do not produce comparable gains on vision-centric benchmarks. revision: yes
Referee: [§4.3] No statistical tests, run-to-run variance, or confidence intervals, weakening the 'consistent' claim given small data fraction and hyperparameter sensitivity.

Authors: We recognize the importance of statistical reporting. Experiments used fixed hyperparameters across models for fairness and showed gains on five distinct MLLMs and multiple benchmarks. Due to compute limits we did not run full multi-seed sweeps for every configuration. In the revised version we will report results from at least three independent runs for the primary settings, include standard deviations, and add a brief discussion of variance. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical method with external benchmarks

full rationale

The paper presents an empirical data-augmentation technique: reformulating pretext tasks (rotation, color matching, cross-view) as instruction triplets and mixing 3-10% into visual instruction tuning. Performance gains are measured on held-out vision-centric benchmarks across multiple models and regimes. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear. The assertion that the tasks 'cannot be solved without visual evidence' is an unproven modeling assumption rather than a derivation that reduces to its own inputs; the reported improvements are externally falsifiable and do not rely on internal self-consistency loops. This is a standard non-circular empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard domain assumptions in machine learning about the value of self-supervision and data mixing, without introducing new free parameters or invented entities.

axioms (1)

domain assumption Self-supervised visual tasks reformulated as instructions cannot be solved without visual evidence
Central premise invoked to justify the method's effectiveness.

pith-pipeline@v0.9.0 · 5522 in / 1098 out tokens · 61560 ms · 2026-05-10T16:22:51.682657+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 23 canonical work pages · 15 internal anchors

[1]

In: AAAI (2020) 4

Ahmed, F., Courville, A.: Detecting semantic anomalies. In: AAAI (2020) 4

2020
[2]

NeurIPS (2022) 1, 3

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. NeurIPS (2022) 1, 3

2022
[3]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

An, X., Xie, Y ., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y ., Xu, S., Chen, C., Zhu, D., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025) 3, 4, 6, 9

work page internal anchor Pith review arXiv 2025
[4]

ICLR (2020) 3, 13, 21

Asano, Y .M., Rupprecht, C., Vedaldi, A.: A critical analysis of self-supervision, or what we can learn from a single image. ICLR (2020) 3, 13, 21

2020
[5]

Leo: Boosting mixture of vision en- coders for multimodal large language models.arXiv preprint arXiv:2501.06986, 2025

Azadani, M.N., Riddell, J., Sedwards, S., Czarnecki, K.: Leo: Boosting mixture of vision encoders for multimodal large language models. arXiv preprint arXiv:2501.06986 (2025) 4

work page arXiv 2025
[6]

Qwen3-VL Technical Report

Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

In: NeurIPS (2020) 1

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: NeurIPS (2020) 1

2020
[8]

Findings of the association for computational linguistics: ACL (2024) 3

Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R.: The revolution of multimodal large language models: A survey. Findings of the association for computational linguistics: ACL (2024) 3

2024
[9]

arXiv preprint arXiv:2512.15885 (2025) 4 16 S

Caffagni, D., Sarto, S., Cornia, M., Baraldi, L., Dovesi, P.L., Roohi, S., Granroth-Wilding, M., Cucchiara, R.: Seeing beyond words: Self-supervised visual learning for multimodal large language models. arXiv preprint arXiv:2512.15885 (2025) 4 16 S. Sirko-Galouchenko et al

work page arXiv 2025
[10]

In: CVPR (2019) 4

Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving jigsaw puzzles. In: CVPR (2019) 4

2019
[11]

NeurIPS (2020) 4

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS (2020) 4

2020
[12]

In: ICCV (2021) 4

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021) 4

2021
[13]

In: CVPR (2024) 4

Cha, J., Kang, W., Mun, J., Roh, B.: Honeybee: Locality-enhanced projector for multimodal llm. In: CVPR (2024) 4

2024
[14]

In: CVPR (2024) 4

Chen, G., Shen, L., Shao, R., Deng, X., Nie, L.: Lion: Empowering multimodal large lan- guage model with dual-level visual knowledge. In: CVPR (2024) 4

2024
[15]

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al.: Are we on the right way for evaluating large vision-language models? NeurIPS (2024) 9, 23

2024
[16]

In: ICML (2020) 4

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020) 4

2020
[17]

In: CVPR (2019) 4

Chen, T., Zhai, X., Ritter, M., Lucic, M., Houlsby, N.: Self-supervised gans via auxiliary rotation loss. In: CVPR (2019) 4

2019
[18]

Improved Baselines with Momentum Contrastive Learning

Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020) 4

work page internal anchor Pith review arXiv 2003
[19]

Chiang, W.L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chat- gpt quality (2023) 9

2023
[20]

ICLR (2024) 20

Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers. ICLR (2024) 20

2024
[21]

In: CVPR (2025) 4

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y ., Park, J.S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the- art vision-language models. In: CVPR (2025) 4

2025
[22]

Deng, A., Cao, T., Chen, Z., Hooi, B.: Words or vision: Do vision-language models have blind faith in text? In: CVPR (2025) 2

2025
[23]

In: CVPR (2015) 4

Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: CVPR (2015) 4

2015
[24]

In: Proceedings of the 32nd ACM international conference on multimedia (2024) 21

Duan, H., Yang, J., Qiao, Y ., Fang, X., Chen, L., Liu, Y ., Dong, X., Zang, Y ., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM international conference on multimedia (2024) 21

2024
[25]

Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008,

Fu, S., Bonnen, T., Guillory, D., Darrell, T.: Hidden in plain sight: Vlms overlook their visual representations. arXiv preprint arXiv:2506.08008 (2025) 2, 4

work page arXiv 2025
[26]

In: ECCV (2024) 2, 9, 23

Fu, X., Hu, Y ., Li, B., Feng, Y ., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not perceive. In: ECCV (2024) 2, 9, 23

2024
[27]

In: ICCV (2019) 4

Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Boosting few-shot visual learn- ing with self-supervision. In: ICCV (2019) 4

2019
[28]

In: CVPR (2020) 4

Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Learning representations by predicting bags of visual words. In: CVPR (2020) 4

2020
[29]

In: CVPR (2021) 4

Gidaris, S., Bursuc, A., Puy, G., Komodakis, N., Cord, M., Pérez, P.: Obow: Online bag-of- visual-words generation for self-supervised learning. In: CVPR (2021) 4

2021
[30]

TMLR (2024) 4

Gidaris, S., Bursuc, A., Siméoni, O., V obeck`y, A., Komodakis, N., Cord, M., Perez, P.: Moca: Self-supervised representation learning by predicting masked online codebook assignments. TMLR (2024) 4

2024
[31]

In: ICLR (2018) 2, 4 V-GIFT17

Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018) 2, 4 V-GIFT17

2018
[32]

NeurIPS (2020) 4

Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new ap- proach to self-supervised learning. NeurIPS (2020) 4

2020
[33]

Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual-language reasoning

Guo, X., Zhou, R., Wang, Y ., Zhang, Q., Zhang, C., Jegelka, S., Wang, X., Chai, J., Yin, G., Lin, W., et al.: Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual- language reasoning. arXiv preprint arXiv:2510.16416 (2025) 4

work page arXiv 2025
[34]

arXiv preprint arXiv:2401.02677 (2024) 7

Gupta, Y ., Jaddipal, V .V ., Prabhala, H., Paul, S., V on Platen, P.: Progressive knowledge dis- tillation of stable diffusion xl using layer level loss. arXiv preprint arXiv:2401.02677 (2024) 7

work page arXiv 2024
[35]

In: CVPR (2022) 4

He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022) 4

2022
[36]

In: CVPR (2020) 4

He, K., Fan, H., Wu, Y ., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020) 4

2020
[37]

In: NeurIPS (2019) 4

Hendrycks, D., Mazeika, M., Kadavath, S., Song, D.: Using self-supervised learning can improve model robustness and uncertainty. In: NeurIPS (2019) 4

2019
[38]

ICLR (2022) 9, 10, 13

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR (2022) 9, 10, 13

2022
[39]

Gemma 3 Technical Report

Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., Rouillard, L., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

In: ECCV (2024) 4

Kar, O.F., Tonioni, A., Poklukar, P., Kulshrestha, A., Zamir, A., Tombari, F.: Brave: Broad- ening the visual encoding of vision-language models. In: ECCV (2024) 4

2024
[41]

In: CVPR (2019) 4

Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: CVPR (2019) 4

2019
[42]

NeurIPS (2023) 3

Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A., Kiela, D., et al.: Obelics: An open web-scale filtered dataset of interleaved image-text documents. NeurIPS (2023) 3

2023
[43]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

In: ICML (2023) 1, 3

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023) 1, 3

2023
[45]

In: EMNLP (2023) 9, 21

Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP (2023) 9, 21

2023
[46]

In: CVPR (2024) 1

Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre-training for visual language models. In: CVPR (2024) 1

2024
[47]

In: CVPR (2025) 4

Lin, J., Chen, H., Fan, Y ., Fan, Y ., Jin, X., Su, H., Fu, J., Shen, X.: Multi-layer visual feature fusion in multimodal llms: Methods, analysis, and best practices. In: CVPR (2025) 4

2025
[48]

In: ECCV (2014) 20, 21

Lin, T.Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014) 20, 21

2014
[49]

In: CVPR (2024) 1, 3, 4, 6, 9

Liu, H., Li, C., Li, Y ., Lee, Y .J.: Improved baselines with visual instruction tuning. In: CVPR (2024) 1, 3, 4, 6, 9

2024
[50]

In: NeurIPS (2023) 1, 2, 3, 6

Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning. In: NeurIPS (2023) 1, 2, 3, 6

2023
[51]

Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

Liu, Y ., Zhang, B., Zang, Y ., Cao, Y ., Xing, L., Dong, X., Duan, H., Lin, D., Wang, J.: Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning. arXiv preprint arXiv:2510.27606 (2025) 4

work page arXiv 2025
[52]

SCIS (2024) 9, 23

Liu, Y ., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. SCIS (2024) 9, 23

2024
[53]

ICLR (2026) 13 18 S

Long, L., Oh, C., Park, S., Li, S.: Understanding language prior of lvlms by contrasting chain-of-embedding. ICLR (2026) 13 18 S. Sirko-Galouchenko et al

2026
[54]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al.: Deepseek-vl: Towards real-world vision-language understanding, 2024. URL https://arxiv. org/abs/2403.05525 (2025) 4

work page internal anchor Pith review arXiv 2024
[55]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023) 9, 23

work page internal anchor Pith review arXiv 2023
[56]

In: ECCV (2024) 4

McKinzie, B., Gan, Z., Fauconnier, J.P., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Belyi, A., et al.: Mm1: methods, analysis and insights from multimodal llm pre- training. In: ECCV (2024) 4

2024
[57]

In: ECCV (2016) 4

Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016) 4

2016
[58]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 2, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

5 technical report

Qwen, A.Y ., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al.: Qwen2. 5 technical report. arXiv preprint (2024) 9

2024
[60]

In: ICML (2021) 2, 3, 9

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 2, 3, 9

2021
[61]

Ea- gle: Exploring the design space for multimodal llms with mixture of encoders.arXiv preprint arXiv:2408.15998,

Shi, M., Liu, F., Wang, S., Liao, S., Radhakrishnan, S., Zhao, Y ., Huang, D.A., Yin, H., Sapra, K., Yacoob, Y ., et al.: Eagle: Exploring the design space for multimodal llms with mixture of encoders. arXiv preprint arXiv:2408.15998 (2024) 4

work page arXiv 2024
[62]

In: ICCV (2025) 7, 20

Sirko-Galouchenko, S., Gidaris, S., V obecky, A., Bursuc, A., Thome, N.: Dip: Unsupervised dense in-context post-training of visual representations. In: ICCV (2025) 7, 20

2025
[63]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Steiner, A., Pinto, A.S., Tschannen, M., Keysers, D., Wang, X., Bitton, Y ., Gritsenko, A., Minderer, M., Sherbondy, A., Long, S., et al.: Paligemma 2: A family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555 (2024) 3

work page internal anchor Pith review arXiv 2024
[64]

Qwen2 Technical Report

Team, Q., et al.: Qwen2 Technical Report. arXiv preprint arXiv:2407.10671 (2024) 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

NeurIPS (2024) 2, 3, 4, 9, 21, 23

Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V ., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric exploration of multimodal llms. NeurIPS (2024) 2, 3, 4, 9, 21, 23

2024
[66]

In: CVPR (2024) 2, 4

Tong, S., Liu, Z., Zhai, Y ., Ma, Y ., LeCun, Y ., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: CVPR (2024) 2, 4

2024
[67]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arxiv 2023. arXiv preprint arXiv:2302.13971 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Venkataramanan, S., Pariza, V ., Salehi, M., Knobel, L., Gidaris, S., Ramzi, E., Bursuc, A., Asano, Y .M.: Franca: Nested matryoshka clustering for scalable visual representation learn- ing. arXiv preprint arXiv:2507.14137 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

In: ICLR (2025) 4

Wang, H., Zheng, A., Zhao, Y ., Wang, T., Ge, Z., Zhang, X., Zhang, Z.: Reconstructive visual instruction tuning. In: ICLR (2025) 4

2025
[71]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

In: CVPR (2019) 2 V-GIFT19

Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019) 2 V-GIFT19

2019
[73]

TMLR (2025) 4

Wang, Z., Zhu, J., Tang, B., Li, Z., Xiong, F., Yu, J., Blaschko, M.B.: Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puzzles. TMLR (2025) 4

2025
[74]

In: ICLR (2026) 4

Wu, P., Zhang, Y ., Diao, H., Li, B., Lu, L., Liu, Z.: Visual jigsaw post-training improves MLLMs. In: ICLR (2026) 4

2026
[75]

In: CVPR (2018) 4

Wu, Z., Xiong, Y ., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric in- stance discrimination. In: CVPR (2018) 4

2018
[76]

xAI: Grok (2024) 9, 23

2024
[77]

In: ICCV (2025) 9

Xie, Y ., Yang, K., An, X., Wu, K., Zhao, Y ., Deng, W., Ran, Z., Wang, Y ., Feng, Z., Miles, R., et al.: Region-based cluster discrimination for visual representation learning. In: ICCV (2025) 9

2025
[78]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

arXiv preprint arXiv:2509.07979 (2025) 4

Yoon, H., Jung, J., Kim, J., Choi, H., Shin, H., Lim, S., An, H., Kim, C., Han, J., Kim, D., et al.: Visual representation alignment for multimodal large language models. arXiv preprint arXiv:2509.07979 (2025) 4, 10, 23

work page arXiv 2025
[80]

A-{yA}, B-{yB},

Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV (2016) 2, 4 20 S. Sirko-Galouchenko et al. A Implementation Details A.1 Self-supervised instruction-tuning tasks Colorization task.We construct a colorization-based visual reasoning task from the COCO 2017 training split [48], discarding grayscale images. For each image, we sam- ple ...

2016