pith. sign in

arxiv: 2606.11854 · v1 · pith:O56Z7NAPnew · submitted 2026-06-10 · 💻 cs.LG · cs.AI· cs.CL

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Pith reviewed 2026-06-27 10:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords fine-tuningmultimodal LLMsparameter-efficientvisual input optimizationLoRA comparisonQwen modelspixel gradient backpropagationart-based training
0
0 comments X

The pith

Optimizing raw pixel inputs to a frozen multimodal LLM can match LoRA accuracy on text benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that task-specific information can be injected into a frozen MLLM solely by adjusting the visual pixel array fed to the model. Gradients from the training loss are backpropagated directly into this pixel array, turning the image input into a trainable soft prompt without touching model weights or the computation graph. This enables use with high-throughput engines that reject weight modifications. Experiments across sizes of the Qwen architecture demonstrate competitive results with LoRA on mathematics and structured tool-use tasks. The resulting pixel patterns can be rendered as stylized computational images.

Core claim

By treating the raw visual input as the sole optimizable element, ART performs fine-tuning on any objective by backpropagating gradients into a plain pixel array. The frozen MLLM processes these optimized pixels through its vision pathway, allowing the model to adapt its text outputs to the target task at accuracy levels that match those obtained by LoRA weight updates.

What carries the argument

Backpropagation of the loss gradient into an optimizable raw pixel array that serves as the sole visual input to the frozen MLLM.

If this is right

  • Fine-tuning becomes possible inside pre-compiled inference engines such as vLLM that reject weight or graph changes.
  • Any differentiable objective can be used because only input gradients are required.
  • The learned pixel patterns can be stylized into task-relevant visual outputs.
  • The method scales across different model sizes within the tested Qwen family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Visual channels may function as general-purpose adapters even for purely textual tasks when the model is multimodal.
  • Input optimization could enable adaptation in black-box settings where internal weights remain inaccessible.
  • The approach invites testing whether the same pixel-array method transfers to other vision-language architectures beyond Qwen.

Load-bearing premise

That the visual input pathway of a frozen MLLM carries enough capacity for task-specific adaptation on text benchmarks when only the pixels are changed.

What would settle it

If ART accuracy on a mathematics benchmark falls more than 5-10 points below LoRA accuracy for the same Qwen model size under identical training steps, the competitiveness claim would not hold.

Figures

Figures reproduced from arXiv: 2606.11854 by Michal Chudoba, Petra Galuscakova, Sergey Alyaev, Tomasz Wiktorski.

Figure 1
Figure 1. Figure 1: Optimized ART artifacts for Qwen3.5-0.8B fine-tuned via ART with DAPO loss from seed images: math [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of the ART artifact during training on ToolMind (Qwen3.5-0.8B). Checkpoints shown at steps 5, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes ART (Art-based Reinforcement Training), a PEFT method for multimodal LLMs that optimizes only the raw pixel values of visual inputs to a frozen MLLM (instead of weights or tokens) to inject task information. This is claimed to enable soft-prompt-style adaptation on precompiled inference engines such as vLLM while achieving accuracy competitive with LoRA on mathematics and structured-tool-use benchmarks for multiple sizes of the open Qwen architecture.

Significance. If the empirical claims hold, the approach would offer a weight-free adaptation route compatible with optimized inference stacks and could stylize inputs as task-relevant images. The core technical premise—that gradients into a pixel array can produce embeddings functionally equivalent to LoRA updates through a frozen vision tower—would be a notable contribution if substantiated with ablations and quantitative results.

major comments (3)
  1. [Abstract] Abstract: the central claim that 'ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks' is presented without any reported numbers, baselines, dataset sizes, error bars, or experimental protocol, rendering the effectiveness assertion unverifiable and load-bearing for the paper's contribution.
  2. [Abstract] The method implicitly assumes that back-propagation into a plain pixel array, processed only by the frozen vision encoder, can encode arbitrary task-specific information sufficient to match LoRA on purely textual benchmarks; no ablations (random vs. optimized images, vision-tower freezing, effective-rank comparison of induced embeddings vs. LoRA matrices) are described to test this assumption.
  3. [Abstract] The description states that the approach 'supports any fine-tuning objective' yet supplies no concrete loss formulation, optimization schedule, or number of optimized images per task, leaving the practical implementation of the pixel-level adaptation underspecified.
minor comments (1)
  1. [Abstract] The title refers to 'Reinforcement Training' while the body describes gradient-based optimization of pixels for an arbitrary objective; clarify whether RL is used or whether the name is purely stylistic.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We address each of the major comments below and have made revisions to the abstract and method sections to incorporate the suggested improvements and provide additional details and ablations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks' is presented without any reported numbers, baselines, dataset sizes, error bars, or experimental protocol, rendering the effectiveness assertion unverifiable and load-bearing for the paper's contribution.

    Authors: We agree that the abstract lacks specific quantitative details. In the revised manuscript, we have updated the abstract to include reported accuracy numbers from our experiments on mathematics and structured-tool-use benchmarks, along with baselines, dataset sizes, and a summary of the experimental protocol. Error bars from multiple runs are also noted. revision: yes

  2. Referee: [Abstract] The method implicitly assumes that back-propagation into a plain pixel array, processed only by the frozen vision encoder, can encode arbitrary task-specific information sufficient to match LoRA on purely textual benchmarks; no ablations (random vs. optimized images, vision-tower freezing, effective-rank comparison of induced embeddings vs. LoRA matrices) are described to test this assumption.

    Authors: While the primary results on textual benchmarks with a frozen vision encoder provide support for the method's ability to encode task information via pixel optimization, we acknowledge the value of explicit ablations. We have added comparisons between random and optimized images, confirmed the freezing of the vision tower, and included an effective-rank analysis of the induced embeddings versus LoRA matrices in the revised version. revision: yes

  3. Referee: [Abstract] The description states that the approach 'supports any fine-tuning objective' yet supplies no concrete loss formulation, optimization schedule, or number of optimized images per task, leaving the practical implementation of the pixel-level adaptation underspecified.

    Authors: We have revised the manuscript to provide the concrete loss formulation as the standard autoregressive language modeling loss, the optimization schedule details, and the number of optimized images per task. These specifics are now included in the abstract and the dedicated method section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or self-referential fits

full rationale

The paper introduces ART as an empirical PEFT technique that optimizes pixel inputs to a frozen MLLM. The abstract and description contain no equations, parameter fits, uniqueness theorems, or derivations that could reduce the performance claim to a self-definition or fitted-input prediction. No self-citations are invoked as load-bearing premises, and the central claim rests on benchmark accuracy comparisons rather than any closed logical loop. This is the common case of a self-contained empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no information is provided on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5731 in / 1178 out tokens · 31907 ms · 2026-06-27T10:28:54.107939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 2 canonical work pages

  1. [1]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/abs/2304.08485

  2. [2]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. URL https://arxiv.org/abs/2310.03744

  3. [3]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  4. [4]

    Fine-tuned ’small’ llms (still) significantly outperform zero-shot generative ai models in text classification, 2024

    Martin Juan José Bucher and Marco Martini. Fine-tuned ’small’ llms (still) significantly outperform zero-shot generative ai models in text classification, 2024. URLhttps://arxiv.org/abs/2406.08660

  5. [5]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yevgeniy Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Lu Wang, Lok Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations (ICLR), 2022. URLhttps://arxiv.org/abs/2106.09685

  6. [6]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. URL https: //arxiv.org/abs/2309.06180. 9 Fine-tuning LLMs with ART

  7. [7]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2021. URL https: //arxiv.org/abs/2104.08691

  8. [8]

    Harsh Raj and Gopa Bhaumik. A comprehensive survey of image steganography: From traditional vision techniques to deep learning paradigms—trends, challenges, and applications.Computer Science Review, 60: 100892, 2026. ISSN 1574-0137. doi: https://doi.org/10.1016/j.cosrev.2026.100892

  9. [9]

    Math icon,

    Flaticon. Math icon, . URLhttps://www.flaticon.com. Accessed 2026

  10. [10]

    Brain icon,

    Flaticon. Brain icon, . URLhttps://www.flaticon.com. Accessed 2026

  11. [11]

    Photos icon wrench

    FreeIconsPNG. Photos icon wrench. URLhttps://www.freeiconspng.com/img/25556. Accessed 2026

  12. [12]

    Adversarial reprogramming of neural networks

    Gamaleldin F Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. Adversarial reprogramming of neural networks. InInternational Conference on Learning Representations (ICLR), 2019. URL https://arxiv.org/abs/1806. 11146

  13. [13]

    Exploring visual prompts for adapting large-scale models.arXiv preprint arXiv:2203.17274, 2022

    Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for adapting large-scale models.arXiv preprint arXiv:2203.17274, 2022. URLhttps://arxiv.org/abs/2203.17274

  14. [14]

    Visual prompt tuning

    Menglin Jia, Liyuan Tang, Bor-Chun Chen, Claire Cardie, Serge BelMH, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InProceedings of the European Conference on Computer Vision (ECCV), 2022. URL https://arxiv.org/abs/2203.12119

  15. [15]

    Visual adversarial examples jailbreak aligned large language models

    Xiangyu Qi, Kaixuan Huang, Ashish Su, Edward Li, Sensen Du, and Jinyuan Gong. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024. URLhttps://arxiv.org/abs/2306.13213

  16. [16]

    Image hijacks: Adversarial images can control multi-modal large language models.arXiv preprint arXiv:2309.00236, 2023

    Luke Bailey, Euan Ong, Alex Gillespie, and Adam Gleave. Image hijacks: Adversarial images can control multi-modal large language models.arXiv preprint arXiv:2309.00236, 2023. URL https://arxiv.org/abs/ 2309.00236

  17. [17]

    Hades: Images are achilles’ heel of alignment

    Boyi Li, Shuo Wang, et al. Hades: Images are achilles’ heel of alignment. InProceedings of the European Conference on Computer Vision (ECCV), 2024. URLhttps://arxiv.org/abs/2403.02794

  18. [18]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the Association for Computational Linguistics (ACL), 2021. URLhttps://arxiv.org/abs/2101.00190

  19. [19]

    DeepSeekMath: Pushing the limits of mathematical reasoning in common language models.arXiv preprint arXiv:2402.03300, 2024

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in common language models.arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

  20. [20]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learn- ing.Nature, 645:633–638, 2025. doi: 10.1038/s41586-025-09422-z. URL https://doi.org/10.1038/ s41586-025-09422-z

  21. [21]

    Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026. URLhttps://arxiv.org/abs/2503.14476

  22. [22]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

  23. [23]

    URLhttps://ieeexplore.ieee.org/document/5206848

  24. [24]

    Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

    Karl Cobbe, Vineet Kosaraju, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168

  25. [26]

    URLhttps://arxiv.org/abs/2311.12022

  26. [27]

    Toolmind technical report: A large-scale, reasoning-enhanced tool-use dataset, 2025

    Chen Yang, Ran Le, Yun Xing, Zhenwei An, Zongchao Chen, Wayne Xin Zhao, Yang Song, and Tao Zhang. Toolmind technical report: A large-scale, reasoning-enhanced tool-use dataset, 2025. URL https://arxiv. org/abs/2511.15718

  27. [28]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024. URLhttps://arxiv.org/abs/2401.06209. 10 Fine-tuning LLMs with ART 8 Appendix 8.1 Dimensionality Alig...