Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Michal Chudoba; Petra Galuscakova; Sergey Alyaev; Tomasz Wiktorski

arxiv: 2606.11854 · v1 · pith:O56Z7NAPnew · submitted 2026-06-10 · 💻 cs.LG · cs.AI· cs.CL

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Michal Chudoba , Sergey Alyaev , Petra Galuscakova , Tomasz Wiktorski This is my paper

Pith reviewed 2026-06-27 10:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords fine-tuningmultimodal LLMsparameter-efficientvisual input optimizationLoRA comparisonQwen modelspixel gradient backpropagationart-based training

0 comments

The pith

Optimizing raw pixel inputs to a frozen multimodal LLM can match LoRA accuracy on text benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that task-specific information can be injected into a frozen MLLM solely by adjusting the visual pixel array fed to the model. Gradients from the training loss are backpropagated directly into this pixel array, turning the image input into a trainable soft prompt without touching model weights or the computation graph. This enables use with high-throughput engines that reject weight modifications. Experiments across sizes of the Qwen architecture demonstrate competitive results with LoRA on mathematics and structured tool-use tasks. The resulting pixel patterns can be rendered as stylized computational images.

Core claim

By treating the raw visual input as the sole optimizable element, ART performs fine-tuning on any objective by backpropagating gradients into a plain pixel array. The frozen MLLM processes these optimized pixels through its vision pathway, allowing the model to adapt its text outputs to the target task at accuracy levels that match those obtained by LoRA weight updates.

What carries the argument

Backpropagation of the loss gradient into an optimizable raw pixel array that serves as the sole visual input to the frozen MLLM.

If this is right

Fine-tuning becomes possible inside pre-compiled inference engines such as vLLM that reject weight or graph changes.
Any differentiable objective can be used because only input gradients are required.
The learned pixel patterns can be stylized into task-relevant visual outputs.
The method scales across different model sizes within the tested Qwen family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Visual channels may function as general-purpose adapters even for purely textual tasks when the model is multimodal.
Input optimization could enable adaptation in black-box settings where internal weights remain inaccessible.
The approach invites testing whether the same pixel-array method transfers to other vision-language architectures beyond Qwen.

Load-bearing premise

That the visual input pathway of a frozen MLLM carries enough capacity for task-specific adaptation on text benchmarks when only the pixels are changed.

What would settle it

If ART accuracy on a mathematics benchmark falls more than 5-10 points below LoRA accuracy for the same Qwen model size under identical training steps, the competitiveness claim would not hold.

Figures

Figures reproduced from arXiv: 2606.11854 by Michal Chudoba, Petra Galuscakova, Sergey Alyaev, Tomasz Wiktorski.

**Figure 2.** Figure 2: Evolution of the ART artifact during training on ToolMind (Qwen3.5-0.8B). Checkpoints shown at steps 5, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The pixel-optimization idea for graph-compatible PEFT is practical in principle, but the abstract supplies no results, ablations, or numbers to show it actually competes with LoRA.

read the letter

The core claim here is that you can fine-tune a frozen MLLM by optimizing a raw pixel array instead of weights or tokens, which keeps the inference graph untouched and works with engines like vLLM. That addresses a real pain point for people who want to adapt models without recompiling everything.

The paper does a decent job framing the limitation of existing PEFT methods and sketching a simple alternative that relies on backprop into pixels. Styling the optimized inputs as computational art is a minor but concrete detail that might help with interpretability.

The problems are straightforward and central. The abstract states that ART reaches accuracy competitive with LoRA on math and structured-tool-use benchmarks for Qwen models of different sizes, yet it contains no tables, no specific scores, no error bars, no dataset sizes, and no mention of baselines or ablations. Without those, there is no way to check whether the vision encoder is actually carrying task-specific signal or whether the optimization is just fitting to low-level image statistics. The stress-test concern about whether one or a few optimized images can condition the model on varied textual examples is left unaddressed.

This is aimed at practitioners who need PEFT that runs on unmodified high-throughput inference stacks. A reader who wants reproducible evidence that pixel optimization can substitute for LoRA-style updates will not find it here.

I would not send this to peer review until the full experiments, numbers, and controls are included. The idea is worth testing, but the current version does not supply enough to evaluate it.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes ART (Art-based Reinforcement Training), a PEFT method for multimodal LLMs that optimizes only the raw pixel values of visual inputs to a frozen MLLM (instead of weights or tokens) to inject task information. This is claimed to enable soft-prompt-style adaptation on precompiled inference engines such as vLLM while achieving accuracy competitive with LoRA on mathematics and structured-tool-use benchmarks for multiple sizes of the open Qwen architecture.

Significance. If the empirical claims hold, the approach would offer a weight-free adaptation route compatible with optimized inference stacks and could stylize inputs as task-relevant images. The core technical premise—that gradients into a pixel array can produce embeddings functionally equivalent to LoRA updates through a frozen vision tower—would be a notable contribution if substantiated with ablations and quantitative results.

major comments (3)

[Abstract] Abstract: the central claim that 'ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks' is presented without any reported numbers, baselines, dataset sizes, error bars, or experimental protocol, rendering the effectiveness assertion unverifiable and load-bearing for the paper's contribution.
[Abstract] The method implicitly assumes that back-propagation into a plain pixel array, processed only by the frozen vision encoder, can encode arbitrary task-specific information sufficient to match LoRA on purely textual benchmarks; no ablations (random vs. optimized images, vision-tower freezing, effective-rank comparison of induced embeddings vs. LoRA matrices) are described to test this assumption.
[Abstract] The description states that the approach 'supports any fine-tuning objective' yet supplies no concrete loss formulation, optimization schedule, or number of optimized images per task, leaving the practical implementation of the pixel-level adaptation underspecified.

minor comments (1)

[Abstract] The title refers to 'Reinforcement Training' while the body describes gradient-based optimization of pixels for an arbitrary objective; clarify whether RL is used or whether the name is purely stylistic.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We address each of the major comments below and have made revisions to the abstract and method sections to incorporate the suggested improvements and provide additional details and ablations.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks' is presented without any reported numbers, baselines, dataset sizes, error bars, or experimental protocol, rendering the effectiveness assertion unverifiable and load-bearing for the paper's contribution.

Authors: We agree that the abstract lacks specific quantitative details. In the revised manuscript, we have updated the abstract to include reported accuracy numbers from our experiments on mathematics and structured-tool-use benchmarks, along with baselines, dataset sizes, and a summary of the experimental protocol. Error bars from multiple runs are also noted. revision: yes
Referee: [Abstract] The method implicitly assumes that back-propagation into a plain pixel array, processed only by the frozen vision encoder, can encode arbitrary task-specific information sufficient to match LoRA on purely textual benchmarks; no ablations (random vs. optimized images, vision-tower freezing, effective-rank comparison of induced embeddings vs. LoRA matrices) are described to test this assumption.

Authors: While the primary results on textual benchmarks with a frozen vision encoder provide support for the method's ability to encode task information via pixel optimization, we acknowledge the value of explicit ablations. We have added comparisons between random and optimized images, confirmed the freezing of the vision tower, and included an effective-rank analysis of the induced embeddings versus LoRA matrices in the revised version. revision: yes
Referee: [Abstract] The description states that the approach 'supports any fine-tuning objective' yet supplies no concrete loss formulation, optimization schedule, or number of optimized images per task, leaving the practical implementation of the pixel-level adaptation underspecified.

Authors: We have revised the manuscript to provide the concrete loss formulation as the standard autoregressive language modeling loss, the optimization schedule details, and the number of optimized images per task. These specifics are now included in the abstract and the dedicated method section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or self-referential fits

full rationale

The paper introduces ART as an empirical PEFT technique that optimizes pixel inputs to a frozen MLLM. The abstract and description contain no equations, parameter fits, uniqueness theorems, or derivations that could reduce the performance claim to a self-definition or fitted-input prediction. No self-citations are invoked as load-bearing premises, and the central claim rests on benchmark accuracy comparisons rather than any closed logical loop. This is the common case of a self-contained empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no information is provided on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5731 in / 1178 out tokens · 31907 ms · 2026-06-27T10:28:54.107939+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 2 canonical work pages

[1]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/abs/2304.08485

Pith/arXiv arXiv 2023
[2]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. URL https://arxiv.org/abs/2310.03744

Pith/arXiv arXiv 2024
[3]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[4]

Fine-tuned ’small’ llms (still) significantly outperform zero-shot generative ai models in text classification, 2024

Martin Juan José Bucher and Marco Martini. Fine-tuned ’small’ llms (still) significantly outperform zero-shot generative ai models in text classification, 2024. URLhttps://arxiv.org/abs/2406.08660

arXiv 2024
[5]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yevgeniy Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Lu Wang, Lok Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations (ICLR), 2022. URLhttps://arxiv.org/abs/2106.09685

Pith/arXiv arXiv 2022
[6]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. URL https: //arxiv.org/abs/2309.06180. 9 Fine-tuning LLMs with ART

Pith/arXiv arXiv 2023
[7]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2021. URL https: //arxiv.org/abs/2104.08691

Pith/arXiv arXiv 2021
[8]

Harsh Raj and Gopa Bhaumik. A comprehensive survey of image steganography: From traditional vision techniques to deep learning paradigms—trends, challenges, and applications.Computer Science Review, 60: 100892, 2026. ISSN 1574-0137. doi: https://doi.org/10.1016/j.cosrev.2026.100892

work page doi:10.1016/j.cosrev.2026.100892 2026
[9]

Math icon,

Flaticon. Math icon, . URLhttps://www.flaticon.com. Accessed 2026

2026
[10]

Brain icon,

Flaticon. Brain icon, . URLhttps://www.flaticon.com. Accessed 2026

2026
[11]

Photos icon wrench

FreeIconsPNG. Photos icon wrench. URLhttps://www.freeiconspng.com/img/25556. Accessed 2026

2026
[12]

Adversarial reprogramming of neural networks

Gamaleldin F Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. Adversarial reprogramming of neural networks. InInternational Conference on Learning Representations (ICLR), 2019. URL https://arxiv.org/abs/1806. 11146

2019
[13]

Exploring visual prompts for adapting large-scale models.arXiv preprint arXiv:2203.17274, 2022

Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for adapting large-scale models.arXiv preprint arXiv:2203.17274, 2022. URLhttps://arxiv.org/abs/2203.17274

arXiv 2022
[14]

Visual prompt tuning

Menglin Jia, Liyuan Tang, Bor-Chun Chen, Claire Cardie, Serge BelMH, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InProceedings of the European Conference on Computer Vision (ECCV), 2022. URL https://arxiv.org/abs/2203.12119

arXiv 2022
[15]

Visual adversarial examples jailbreak aligned large language models

Xiangyu Qi, Kaixuan Huang, Ashish Su, Edward Li, Sensen Du, and Jinyuan Gong. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024. URLhttps://arxiv.org/abs/2306.13213

arXiv 2024
[16]

Image hijacks: Adversarial images can control multi-modal large language models.arXiv preprint arXiv:2309.00236, 2023

Luke Bailey, Euan Ong, Alex Gillespie, and Adam Gleave. Image hijacks: Adversarial images can control multi-modal large language models.arXiv preprint arXiv:2309.00236, 2023. URL https://arxiv.org/abs/ 2309.00236

arXiv 2023
[17]

Hades: Images are achilles’ heel of alignment

Boyi Li, Shuo Wang, et al. Hades: Images are achilles’ heel of alignment. InProceedings of the European Conference on Computer Vision (ECCV), 2024. URLhttps://arxiv.org/abs/2403.02794

arXiv 2024
[18]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the Association for Computational Linguistics (ACL), 2021. URLhttps://arxiv.org/abs/2101.00190

Pith/arXiv arXiv 2021
[19]

DeepSeekMath: Pushing the limits of mathematical reasoning in common language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in common language models.arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

Pith/arXiv arXiv 2024
[20]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learn- ing.Nature, 645:633–638, 2025. doi: 10.1038/s41586-025-09422-z. URL https://doi.org/10.1038/ s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[21]

Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026. URLhttps://arxiv.org/abs/2503.14476

Pith/arXiv arXiv 2026
[22]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
[23]

URLhttps://ieeexplore.ieee.org/document/5206848

arXiv
[24]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168

Pith/arXiv arXiv 2021
[26]

URLhttps://arxiv.org/abs/2311.12022

Pith/arXiv arXiv
[27]

Toolmind technical report: A large-scale, reasoning-enhanced tool-use dataset, 2025

Chen Yang, Ran Le, Yun Xing, Zhenwei An, Zongchao Chen, Wayne Xin Zhao, Yang Song, and Tao Zhang. Toolmind technical report: A large-scale, reasoning-enhanced tool-use dataset, 2025. URL https://arxiv. org/abs/2511.15718

arXiv 2025
[28]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024. URLhttps://arxiv.org/abs/2401.06209. 10 Fine-tuning LLMs with ART 8 Appendix 8.1 Dimensionality Alig...

arXiv 2024

[1] [1]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/abs/2304.08485

Pith/arXiv arXiv 2023

[2] [2]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. URL https://arxiv.org/abs/2310.03744

Pith/arXiv arXiv 2024

[3] [3]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026

[4] [4]

Fine-tuned ’small’ llms (still) significantly outperform zero-shot generative ai models in text classification, 2024

Martin Juan José Bucher and Marco Martini. Fine-tuned ’small’ llms (still) significantly outperform zero-shot generative ai models in text classification, 2024. URLhttps://arxiv.org/abs/2406.08660

arXiv 2024

[5] [5]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yevgeniy Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Lu Wang, Lok Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations (ICLR), 2022. URLhttps://arxiv.org/abs/2106.09685

Pith/arXiv arXiv 2022

[6] [6]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. URL https: //arxiv.org/abs/2309.06180. 9 Fine-tuning LLMs with ART

Pith/arXiv arXiv 2023

[7] [7]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the Empirical Methods in Natural Language Processing (EMNLP), 2021. URL https: //arxiv.org/abs/2104.08691

Pith/arXiv arXiv 2021

[8] [8]

Harsh Raj and Gopa Bhaumik. A comprehensive survey of image steganography: From traditional vision techniques to deep learning paradigms—trends, challenges, and applications.Computer Science Review, 60: 100892, 2026. ISSN 1574-0137. doi: https://doi.org/10.1016/j.cosrev.2026.100892

work page doi:10.1016/j.cosrev.2026.100892 2026

[9] [9]

Math icon,

Flaticon. Math icon, . URLhttps://www.flaticon.com. Accessed 2026

2026

[10] [10]

Brain icon,

Flaticon. Brain icon, . URLhttps://www.flaticon.com. Accessed 2026

2026

[11] [11]

Photos icon wrench

FreeIconsPNG. Photos icon wrench. URLhttps://www.freeiconspng.com/img/25556. Accessed 2026

2026

[12] [12]

Adversarial reprogramming of neural networks

Gamaleldin F Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. Adversarial reprogramming of neural networks. InInternational Conference on Learning Representations (ICLR), 2019. URL https://arxiv.org/abs/1806. 11146

2019

[13] [13]

Exploring visual prompts for adapting large-scale models.arXiv preprint arXiv:2203.17274, 2022

Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for adapting large-scale models.arXiv preprint arXiv:2203.17274, 2022. URLhttps://arxiv.org/abs/2203.17274

arXiv 2022

[14] [14]

Visual prompt tuning

Menglin Jia, Liyuan Tang, Bor-Chun Chen, Claire Cardie, Serge BelMH, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InProceedings of the European Conference on Computer Vision (ECCV), 2022. URL https://arxiv.org/abs/2203.12119

arXiv 2022

[15] [15]

Visual adversarial examples jailbreak aligned large language models

Xiangyu Qi, Kaixuan Huang, Ashish Su, Edward Li, Sensen Du, and Jinyuan Gong. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024. URLhttps://arxiv.org/abs/2306.13213

arXiv 2024

[16] [16]

Image hijacks: Adversarial images can control multi-modal large language models.arXiv preprint arXiv:2309.00236, 2023

Luke Bailey, Euan Ong, Alex Gillespie, and Adam Gleave. Image hijacks: Adversarial images can control multi-modal large language models.arXiv preprint arXiv:2309.00236, 2023. URL https://arxiv.org/abs/ 2309.00236

arXiv 2023

[17] [17]

Hades: Images are achilles’ heel of alignment

Boyi Li, Shuo Wang, et al. Hades: Images are achilles’ heel of alignment. InProceedings of the European Conference on Computer Vision (ECCV), 2024. URLhttps://arxiv.org/abs/2403.02794

arXiv 2024

[18] [18]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the Association for Computational Linguistics (ACL), 2021. URLhttps://arxiv.org/abs/2101.00190

Pith/arXiv arXiv 2021

[19] [19]

DeepSeekMath: Pushing the limits of mathematical reasoning in common language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in common language models.arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

Pith/arXiv arXiv 2024

[20] [20]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learn- ing.Nature, 645:633–638, 2025. doi: 10.1038/s41586-025-09422-z. URL https://doi.org/10.1038/ s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025

[21] [21]

Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026. URLhttps://arxiv.org/abs/2503.14476

Pith/arXiv arXiv 2026

[22] [22]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

[23] [23]

URLhttps://ieeexplore.ieee.org/document/5206848

arXiv

[24] [24]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168

Pith/arXiv arXiv 2021

[25] [26]

URLhttps://arxiv.org/abs/2311.12022

Pith/arXiv arXiv

[26] [27]

Toolmind technical report: A large-scale, reasoning-enhanced tool-use dataset, 2025

Chen Yang, Ran Le, Yun Xing, Zhenwei An, Zongchao Chen, Wayne Xin Zhao, Yang Song, and Tao Zhang. Toolmind technical report: A large-scale, reasoning-enhanced tool-use dataset, 2025. URL https://arxiv. org/abs/2511.15718

arXiv 2025

[27] [28]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024. URLhttps://arxiv.org/abs/2401.06209. 10 Fine-tuning LLMs with ART 8 Appendix 8.1 Dimensionality Alig...

arXiv 2024