pith. machine review for the scientific record. sign in

arxiv: 2604.12966 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual instruction tuningself-supervised learningmultimodal large language modelsvision-centric taskspretext tasksvisual reasoning
0
0 comments X

The pith

Reformulating self-supervised tasks as instructions improves vision-centric performance in multimodal models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often underuse visual information during instruction tuning because many tasks can be solved with language priors alone. The paper proposes converting classical self-supervised pretext tasks such as rotation prediction, color matching, and cross-view correspondence into natural language image-instruction-response triplets. Adding only 3-10% of these visually grounded examples to the training data consistently raises scores on vision-centric benchmarks. The method requires no human labels, no architecture changes, and no extra training stages, working across models and regimes by shifting the data distribution.

Core claim

By reformulating self-supervised pretext tasks as image-instruction-response triplets that cannot be solved without visual evidence, injecting a small fraction of such instructions during visual instruction tuning yields consistent gains on vision-centric evaluations across multiple models, training regimes, and benchmarks.

What carries the argument

Reformulation of classical self-supervised pretext tasks into image-instruction-response triplets that force reliance on visual input rather than language priors.

If this is right

  • Vision-centric benchmark scores rise without any model architecture or training procedure changes.
  • The gains appear across different multimodal models and instruction-tuning regimes.
  • Only a small fraction of the overall training data needs to consist of the visually grounded instructions.
  • Adjusting the distribution of instruction data is sufficient to improve visual reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reformulation trick could be tested on other input modalities where models lean on priors.
  • This points to data composition as a higher-leverage knob than scaling model size for visual tasks.
  • Extending the set of pretext tasks to include additional visual properties would test the generality of the approach.

Load-bearing premise

The reformulated self-supervised tasks cannot be solved using language priors alone and therefore compel the model to utilize visual evidence.

What would settle it

If models achieve the same performance gains when the self-supervised instructions are replaced by non-visual text-only equivalents, the claim that visual grounding drives the improvement would be falsified.

Figures

Figures reproduced from arXiv: 2604.12966 by Andrei Bursuc, Monika Wysoczanska, Nicolas Thome, Sophia Sirko-Galouchenko, Spyros Gidaris.

Figure 1
Figure 1. Figure 1: Visually Grounded Instruction Fine-Tuning V-GIFT. We enhance visual instruction tun￾ing by injecting visually grounded self-supervised tasks as additional instruction-following ex￾amples sampled from the instruction-tuning data (left; rotation prediction shown). This simple modification encourages better use of visual information and yields consistent gains on vision￾centric benchmarks (right; CVB-2D, POPE… view at source ↗
Figure 2
Figure 2. Figure 2: Visually grounded instruction-following tasks reformulated from self-supervised learning (SSL) pretext tasks. (a) Rotation prediction: the model must recognize object orientations and relate it to canonical poses. (b) Point-wise colorization: the model must match grayscale points to their original colors, requiring fine-grained visual discrimination, spatial grounding, and rea￾soning over local and global … view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the SSL injection ratio ρ on vision-centric instruction-following performance for LLaVA-1.5-Qwen2.5-7B (left) and LLaVA-OneVision-1.5 (right) [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attention map from the Baseline (LLaVA-1.5-Vicuna-7B) and V-GIFT on CV-Bench2D examples. V-GIFT produces more focused and better localized attention on task-relevant objects. Q: Is the camera moving left or right? Baseline: right V-GIFT: left Q: Is the cat beneath the car? Baseline: Yes V-GIFT: No Q: Which point functionally corresponds to REF? Baseline: Point B V-GIFT: Point C Q: How many people are weari… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples. We present a few qualitative examples comparing LLaVA-1.5 Qwen-2.5-7B baseline against V-GIFT. Our SSL-inspired tasks yield improvements on the va￾riety of vision oriented skills such as counting, multi-view reasoning and visual reasoning. comparing baseline LLaVA-1.5 Vicuna 7B trained with standard Instruction Tun￾ing dataset and the model trained with V-GIFT. We observe that model t… view at source ↗
Figure 6
Figure 6. Figure 6: Examples of the visually grounded self-supervised tasks used during training: colorization point matching, point correspondence, and rotation prediction [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that augmenting visual instruction tuning with 3-10% reformulated self-supervised pretext tasks (rotation prediction, color matching, cross-view correspondence) expressed as image-instruction-response triplets improves MLLM performance on vision-centric benchmarks. These tasks are asserted to supply supervision that cannot be solved using language priors alone, thereby compelling greater utilization of visual features during tuning. The method requires no annotations, architectural changes, or extra stages, and yields consistent gains across models, regimes, and benchmarks. Code is released at https://github.com/sirkosophia/V-GIFT.

Significance. If the reported gains are robust and specifically attributable to compelled visual grounding rather than data volume or diversity effects, the work provides a lightweight, annotation-free lever for improving visual reasoning in MLLMs. This could influence data curation practices for instruction tuning. The open-source code is a clear strength that supports reproducibility and extension.

major comments (3)
  1. [Abstract, §3] Abstract and §3: The load-bearing assertion that the reformulated SSL tasks 'cannot be solved without relying on visual evidence' is stated but not tested. No ablation evaluates whether a language-only model or a vision-ablated input can solve the tasks above chance (e.g., via common object-color associations for color matching or orientation statistics for rotation). Without this, gains cannot be confidently attributed to visual utilization rather than generic instruction data effects.
  2. [§4, Table 2] §4 and Table 2: Performance tables show improvements on vision-centric evaluations, but lack controls that inject equivalent volumes of non-SSL instructions (random or language-prior-heavy) to isolate the contribution of the visual-grounding mechanism. The 3-10% fraction is presented as key, yet no scaling or volume-matched baseline is reported.
  3. [§4.3] §4.3: While multiple models and benchmarks are evaluated, the manuscript provides no statistical tests, run-to-run variance, or confidence intervals. This weakens the claim of 'consistent' improvements, especially given the small data fraction and potential sensitivity to training hyperparameters.
minor comments (2)
  1. [§2] §2: Related work on SSL in vision-language models is cited, but the discussion of how the proposed reformulation differs from prior uses of pretext tasks in MLLM training could be expanded for clarity.
  2. [Figure 1] Figure 1: The diagram illustrating the data augmentation pipeline is helpful, but the caption should explicitly note the exact percentage of SSL samples used in the illustrated example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the contributions and strengthen the evidence for our claims. We address each major point below and commit to revisions that directly respond to the concerns while preserving the core findings.

read point-by-point responses
  1. Referee: [Abstract, §3] The assertion that reformulated SSL tasks 'cannot be solved without relying on visual evidence' is stated but not tested. No ablation with language-only model or vision-ablated input to check if solvable above chance via priors.

    Authors: We agree this explicit test would provide stronger attribution. The tasks were selected precisely because classical SSL literature shows they depend on visual properties (e.g., rotation requires image orientation; cross-view correspondence requires spatial alignment not deducible from text). In the revised manuscript we will add a controlled ablation: (i) a text-only LLM baseline on the same instruction triplets and (ii) a vision-ablated MLLM variant, demonstrating near-chance performance and thereby confirming the visual-grounding requirement. revision: yes

  2. Referee: [§4, Table 2] Performance tables lack controls injecting equivalent volumes of non-SSL instructions (random or language-prior-heavy) to isolate visual-grounding mechanism; no volume-matched or scaling baseline for the 3-10% fraction.

    Authors: We acknowledge that a direct volume-matched control would better isolate the mechanism. Our current setup keeps the base instruction data fixed and adds only the SSL fraction, so gains are measured atop identical data volume. In revision we will add a control experiment replacing the SSL triplets with an equal number of randomly sampled or language-prior-heavy instructions drawn from existing VQA-style data, showing that these do not produce comparable gains on vision-centric benchmarks. revision: yes

  3. Referee: [§4.3] No statistical tests, run-to-run variance, or confidence intervals, weakening the 'consistent' claim given small data fraction and hyperparameter sensitivity.

    Authors: We recognize the importance of statistical reporting. Experiments used fixed hyperparameters across models for fairness and showed gains on five distinct MLLMs and multiple benchmarks. Due to compute limits we did not run full multi-seed sweeps for every configuration. In the revised version we will report results from at least three independent runs for the primary settings, include standard deviations, and add a brief discussion of variance. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical method with external benchmarks

full rationale

The paper presents an empirical data-augmentation technique: reformulating pretext tasks (rotation, color matching, cross-view) as instruction triplets and mixing 3-10% into visual instruction tuning. Performance gains are measured on held-out vision-centric benchmarks across multiple models and regimes. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear. The assertion that the tasks 'cannot be solved without visual evidence' is an unproven modeling assumption rather than a derivation that reduces to its own inputs; the reported improvements are externally falsifiable and do not rely on internal self-consistency loops. This is a standard non-circular empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard domain assumptions in machine learning about the value of self-supervision and data mixing, without introducing new free parameters or invented entities.

axioms (1)
  • domain assumption Self-supervised visual tasks reformulated as instructions cannot be solved without visual evidence
    Central premise invoked to justify the method's effectiveness.

pith-pipeline@v0.9.0 · 5522 in / 1098 out tokens · 61560 ms · 2026-05-10T16:22:51.682657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 23 canonical work pages · 15 internal anchors

  1. [1]

    In: AAAI (2020) 4

    Ahmed, F., Courville, A.: Detecting semantic anomalies. In: AAAI (2020) 4

  2. [2]

    NeurIPS (2022) 1, 3

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. NeurIPS (2022) 1, 3

  3. [3]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    An, X., Xie, Y ., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y ., Xu, S., Chen, C., Zhu, D., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025) 3, 4, 6, 9

  4. [4]

    ICLR (2020) 3, 13, 21

    Asano, Y .M., Rupprecht, C., Vedaldi, A.: A critical analysis of self-supervision, or what we can learn from a single image. ICLR (2020) 3, 13, 21

  5. [5]

    Leo: Boosting mixture of vision en- coders for multimodal large language models.arXiv preprint arXiv:2501.06986, 2025

    Azadani, M.N., Riddell, J., Sedwards, S., Czarnecki, K.: Leo: Boosting mixture of vision encoders for multimodal large language models. arXiv preprint arXiv:2501.06986 (2025) 4

  6. [6]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 3

  7. [7]

    In: NeurIPS (2020) 1

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: NeurIPS (2020) 1

  8. [8]

    Findings of the association for computational linguistics: ACL (2024) 3

    Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R.: The revolution of multimodal large language models: A survey. Findings of the association for computational linguistics: ACL (2024) 3

  9. [9]

    arXiv preprint arXiv:2512.15885 (2025) 4 16 S

    Caffagni, D., Sarto, S., Cornia, M., Baraldi, L., Dovesi, P.L., Roohi, S., Granroth-Wilding, M., Cucchiara, R.: Seeing beyond words: Self-supervised visual learning for multimodal large language models. arXiv preprint arXiv:2512.15885 (2025) 4 16 S. Sirko-Galouchenko et al

  10. [10]

    In: CVPR (2019) 4

    Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving jigsaw puzzles. In: CVPR (2019) 4

  11. [11]

    NeurIPS (2020) 4

    Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS (2020) 4

  12. [12]

    In: ICCV (2021) 4

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021) 4

  13. [13]

    In: CVPR (2024) 4

    Cha, J., Kang, W., Mun, J., Roh, B.: Honeybee: Locality-enhanced projector for multimodal llm. In: CVPR (2024) 4

  14. [14]

    In: CVPR (2024) 4

    Chen, G., Shen, L., Shao, R., Deng, X., Nie, L.: Lion: Empowering multimodal large lan- guage model with dual-level visual knowledge. In: CVPR (2024) 4

  15. [15]

    Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al.: Are we on the right way for evaluating large vision-language models? NeurIPS (2024) 9, 23

  16. [16]

    In: ICML (2020) 4

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020) 4

  17. [17]

    In: CVPR (2019) 4

    Chen, T., Zhai, X., Ritter, M., Lucic, M., Houlsby, N.: Self-supervised gans via auxiliary rotation loss. In: CVPR (2019) 4

  18. [18]

    Improved Baselines with Momentum Contrastive Learning

    Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020) 4

  19. [19]

    Chiang, W.L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chat- gpt quality (2023) 9

  20. [20]

    ICLR (2024) 20

    Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers. ICLR (2024) 20

  21. [21]

    In: CVPR (2025) 4

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y ., Park, J.S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the- art vision-language models. In: CVPR (2025) 4

  22. [22]

    Deng, A., Cao, T., Chen, Z., Hooi, B.: Words or vision: Do vision-language models have blind faith in text? In: CVPR (2025) 2

  23. [23]

    In: CVPR (2015) 4

    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: CVPR (2015) 4

  24. [24]

    In: Proceedings of the 32nd ACM international conference on multimedia (2024) 21

    Duan, H., Yang, J., Qiao, Y ., Fang, X., Chen, L., Liu, Y ., Dong, X., Zang, Y ., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM international conference on multimedia (2024) 21

  25. [25]

    Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008,

    Fu, S., Bonnen, T., Guillory, D., Darrell, T.: Hidden in plain sight: Vlms overlook their visual representations. arXiv preprint arXiv:2506.08008 (2025) 2, 4

  26. [26]

    In: ECCV (2024) 2, 9, 23

    Fu, X., Hu, Y ., Li, B., Feng, Y ., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not perceive. In: ECCV (2024) 2, 9, 23

  27. [27]

    In: ICCV (2019) 4

    Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Boosting few-shot visual learn- ing with self-supervision. In: ICCV (2019) 4

  28. [28]

    In: CVPR (2020) 4

    Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Learning representations by predicting bags of visual words. In: CVPR (2020) 4

  29. [29]

    In: CVPR (2021) 4

    Gidaris, S., Bursuc, A., Puy, G., Komodakis, N., Cord, M., Pérez, P.: Obow: Online bag-of- visual-words generation for self-supervised learning. In: CVPR (2021) 4

  30. [30]

    TMLR (2024) 4

    Gidaris, S., Bursuc, A., Siméoni, O., V obeck`y, A., Komodakis, N., Cord, M., Perez, P.: Moca: Self-supervised representation learning by predicting masked online codebook assignments. TMLR (2024) 4

  31. [31]

    In: ICLR (2018) 2, 4 V-GIFT17

    Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018) 2, 4 V-GIFT17

  32. [32]

    NeurIPS (2020) 4

    Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new ap- proach to self-supervised learning. NeurIPS (2020) 4

  33. [33]

    Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual-language reasoning

    Guo, X., Zhou, R., Wang, Y ., Zhang, Q., Zhang, C., Jegelka, S., Wang, X., Chai, J., Yin, G., Lin, W., et al.: Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual- language reasoning. arXiv preprint arXiv:2510.16416 (2025) 4

  34. [34]

    arXiv preprint arXiv:2401.02677 (2024) 7

    Gupta, Y ., Jaddipal, V .V ., Prabhala, H., Paul, S., V on Platen, P.: Progressive knowledge dis- tillation of stable diffusion xl using layer level loss. arXiv preprint arXiv:2401.02677 (2024) 7

  35. [35]

    In: CVPR (2022) 4

    He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022) 4

  36. [36]

    In: CVPR (2020) 4

    He, K., Fan, H., Wu, Y ., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020) 4

  37. [37]

    In: NeurIPS (2019) 4

    Hendrycks, D., Mazeika, M., Kadavath, S., Song, D.: Using self-supervised learning can improve model robustness and uncertainty. In: NeurIPS (2019) 4

  38. [38]

    ICLR (2022) 9, 10, 13

    Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR (2022) 9, 10, 13

  39. [39]

    Gemma 3 Technical Report

    Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., Rouillard, L., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025) 3

  40. [40]

    In: ECCV (2024) 4

    Kar, O.F., Tonioni, A., Poklukar, P., Kulshrestha, A., Zamir, A., Tombari, F.: Brave: Broad- ening the visual encoding of vision-language models. In: ECCV (2024) 4

  41. [41]

    In: CVPR (2019) 4

    Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: CVPR (2019) 4

  42. [42]

    NeurIPS (2023) 3

    Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A., Kiela, D., et al.: Obelics: An open web-scale filtered dataset of interleaved image-text documents. NeurIPS (2023) 3

  43. [43]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 1, 3, 6

  44. [44]

    In: ICML (2023) 1, 3

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023) 1, 3

  45. [45]

    In: EMNLP (2023) 9, 21

    Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP (2023) 9, 21

  46. [46]

    In: CVPR (2024) 1

    Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre-training for visual language models. In: CVPR (2024) 1

  47. [47]

    In: CVPR (2025) 4

    Lin, J., Chen, H., Fan, Y ., Fan, Y ., Jin, X., Su, H., Fu, J., Shen, X.: Multi-layer visual feature fusion in multimodal llms: Methods, analysis, and best practices. In: CVPR (2025) 4

  48. [48]

    In: ECCV (2014) 20, 21

    Lin, T.Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014) 20, 21

  49. [49]

    In: CVPR (2024) 1, 3, 4, 6, 9

    Liu, H., Li, C., Li, Y ., Lee, Y .J.: Improved baselines with visual instruction tuning. In: CVPR (2024) 1, 3, 4, 6, 9

  50. [50]

    In: NeurIPS (2023) 1, 2, 3, 6

    Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning. In: NeurIPS (2023) 1, 2, 3, 6

  51. [51]

    Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

    Liu, Y ., Zhang, B., Zang, Y ., Cao, Y ., Xing, L., Dong, X., Duan, H., Lin, D., Wang, J.: Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning. arXiv preprint arXiv:2510.27606 (2025) 4

  52. [52]

    SCIS (2024) 9, 23

    Liu, Y ., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. SCIS (2024) 9, 23

  53. [53]

    ICLR (2026) 13 18 S

    Long, L., Oh, C., Park, S., Li, S.: Understanding language prior of lvlms by contrasting chain-of-embedding. ICLR (2026) 13 18 S. Sirko-Galouchenko et al

  54. [54]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al.: Deepseek-vl: Towards real-world vision-language understanding, 2024. URL https://arxiv. org/abs/2403.05525 (2025) 4

  55. [55]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023) 9, 23

  56. [56]

    In: ECCV (2024) 4

    McKinzie, B., Gan, Z., Fauconnier, J.P., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Belyi, A., et al.: Mm1: methods, analysis and insights from multimodal llm pre- training. In: ECCV (2024) 4

  57. [57]

    In: ECCV (2016) 4

    Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016) 4

  58. [58]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 2, 4, 7

  59. [59]

    5 technical report

    Qwen, A.Y ., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al.: Qwen2. 5 technical report. arXiv preprint (2024) 9

  60. [60]

    In: ICML (2021) 2, 3, 9

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 2, 3, 9

  61. [61]

    Ea- gle: Exploring the design space for multimodal llms with mixture of encoders.arXiv preprint arXiv:2408.15998,

    Shi, M., Liu, F., Wang, S., Liao, S., Radhakrishnan, S., Zhao, Y ., Huang, D.A., Yin, H., Sapra, K., Yacoob, Y ., et al.: Eagle: Exploring the design space for multimodal llms with mixture of encoders. arXiv preprint arXiv:2408.15998 (2024) 4

  62. [62]

    In: ICCV (2025) 7, 20

    Sirko-Galouchenko, S., Gidaris, S., V obecky, A., Bursuc, A., Thome, N.: Dip: Unsupervised dense in-context post-training of visual representations. In: ICCV (2025) 7, 20

  63. [63]

    PaliGemma 2: A Family of Versatile VLMs for Transfer

    Steiner, A., Pinto, A.S., Tschannen, M., Keysers, D., Wang, X., Bitton, Y ., Gritsenko, A., Minderer, M., Sherbondy, A., Long, S., et al.: Paligemma 2: A family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555 (2024) 3

  64. [64]

    Qwen2 Technical Report

    Team, Q., et al.: Qwen2 Technical Report. arXiv preprint arXiv:2407.10671 (2024) 1

  65. [65]

    NeurIPS (2024) 2, 3, 4, 9, 21, 23

    Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V ., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric exploration of multimodal llms. NeurIPS (2024) 2, 3, 4, 9, 21, 23

  66. [66]

    In: CVPR (2024) 2, 4

    Tong, S., Liu, Z., Zhai, Y ., Ma, Y ., LeCun, Y ., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: CVPR (2024) 2, 4

  67. [67]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arxiv 2023. arXiv preprint arXiv:2302.13971 (2023) 1

  68. [68]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 3

  69. [69]

    Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

    Venkataramanan, S., Pariza, V ., Salehi, M., Knobel, L., Gidaris, S., Ramzi, E., Bursuc, A., Asano, Y .M.: Franca: Nested matryoshka clustering for scalable visual representation learn- ing. arXiv preprint arXiv:2507.14137 (2025) 4

  70. [70]

    In: ICLR (2025) 4

    Wang, H., Zheng, A., Zhao, Y ., Wang, T., Ge, Z., Zhang, X., Zhang, Z.: Reconstructive visual instruction tuning. In: ICLR (2025) 4

  71. [71]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 3

  72. [72]

    In: CVPR (2019) 2 V-GIFT19

    Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019) 2 V-GIFT19

  73. [73]

    TMLR (2025) 4

    Wang, Z., Zhu, J., Tang, B., Li, Z., Xiong, F., Yu, J., Blaschko, M.B.: Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puzzles. TMLR (2025) 4

  74. [74]

    In: ICLR (2026) 4

    Wu, P., Zhang, Y ., Diao, H., Li, B., Lu, L., Liu, Z.: Visual jigsaw post-training improves MLLMs. In: ICLR (2026) 4

  75. [75]

    In: CVPR (2018) 4

    Wu, Z., Xiong, Y ., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric in- stance discrimination. In: CVPR (2018) 4

  76. [76]

    xAI: Grok (2024) 9, 23

  77. [77]

    In: ICCV (2025) 9

    Xie, Y ., Yang, K., An, X., Wu, K., Zhao, Y ., Deng, W., Ran, Z., Wang, Y ., Feng, Z., Miles, R., et al.: Region-based cluster discrimination for visual representation learning. In: ICCV (2025) 9

  78. [78]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 9

  79. [79]

    arXiv preprint arXiv:2509.07979 (2025) 4

    Yoon, H., Jung, J., Kim, J., Choi, H., Shin, H., Lim, S., An, H., Kim, C., Han, J., Kim, D., et al.: Visual representation alignment for multimodal large language models. arXiv preprint arXiv:2509.07979 (2025) 4, 10, 23

  80. [80]

    A-{yA}, B-{yB},

    Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV (2016) 2, 4 20 S. Sirko-Galouchenko et al. A Implementation Details A.1 Self-supervised instruction-tuning tasks Colorization task.We construct a colorization-based visual reasoning task from the COCO 2017 training split [48], discarding grayscale images. For each image, we sam- ple ...