ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models

Qinjian Zhao; Sumon Biswas; Zhihao Dou; Zhiqiang Gao

arxiv: 2607.00361 · v1 · pith:BZR6X5V5new · submitted 2026-07-01 · 💻 cs.CR

ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models

Zhihao Dou , Qinjian Zhao , Zhiqiang Gao , Sumon Biswas This is my paper

Pith reviewed 2026-07-02 11:46 UTC · model grok-4.3

classification 💻 cs.CR

keywords backdoor attacksvision-language modelschain-of-thought reasoningreasoning-level attacksentropy reboundstealthy attacksmultimodal security

0 comments

The pith

ReShift redirects internal chain-of-thought trajectories in vision-language models upon a trigger while preserving output coherence and clean accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ReShift establishes a backdoor attack framework that operates at the reasoning level by redirecting the internal chain-of-thought trajectory in vision-language models at specific trigger points. This matters because prior backdoor methods typically alter only final outputs and produce inconsistent or detectable reasoning traces, leaving models vulnerable yet open to straightforward detection. The approach uses a Poisoned Reasoning-Aware Data Construction pipeline together with a Supervised-Reinforcement Joint Optimization strategy to embed stable trigger-conditioned shifts, and it formalizes Entropy Rebound as a signal with theoretical links between entropy gaps and trajectory divergence. A sympathetic reader would care because vision-language models are moving into safety-critical roles where hidden manipulation of reasoning steps could produce failures that current output-focused defenses miss.

Core claim

ReShift is a reasoning-level backdoor framework that explicitly redirects the internal chain-of-thought trajectory while preserving surface-level coherence. It achieves this through the Poisoned Reasoning-Aware Data Construction pipeline and the Supervised-Reinforcement Joint Optimization strategy, which induce stable trigger-conditioned reasoning shifts. The work formalizes Entropy Rebound as a principled signal for redirection and supplies theoretical guarantees that connect entropy gaps to trajectory-level divergence. Experiments show the method delivers high attack success rates, maintains clean-task performance, and produces realistic reasoning traces that improve stealthiness against e

What carries the argument

The aha-moment-driven reasoning redirection mechanism, implemented via the Poisoned Reasoning-Aware Data Construction pipeline and Supervised-Reinforcement Joint Optimization strategy, that shifts internal CoT trajectories on trigger while keeping surface outputs coherent.

If this is right

High attack success rates on triggered inputs with realistic reasoning traces.
No degradation in performance on clean inputs.
Improved resistance to existing output-focused backdoor defenses.
Theoretical connection between measurable entropy gaps and actual divergence in reasoning trajectories.
Stable redirection of chain-of-thought at trigger points without breaking surface coherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses may need to monitor internal entropy signals or full reasoning trajectories rather than final outputs alone.
The same redirection approach could be explored for controlled, benign adjustment of model reasoning paths.
The entropy rebound signal might generalize as a diagnostic tool across other multimodal or language-only models.
Wider testing on additional VLM families would clarify how broadly the PRDC and SRJO construction transfers.

Load-bearing premise

The Poisoned Reasoning-Aware Data Construction pipeline and Supervised-Reinforcement Joint Optimization strategy can produce stable trigger-conditioned reasoning shifts that remain undetectable by existing defenses.

What would settle it

An evaluation in which current backdoor detection methods are run on ReShift-attacked models and either flag the altered reasoning traces or reduce attack success rate below usable levels while clean accuracy stays high.

Figures

Figures reproduced from arXiv: 2607.00361 by Qinjian Zhao, Sumon Biswas, Zhihao Dou, Zhiqiang Gao.

**Figure 1.** Figure 1: Illustration of (a) BadToken and (b) Our ReShift. Unlike BadToken, which directly overrides the final answer, our ReShift induces a reasoninglevel turn during generation, redirecting the trajectory before convergence and leading to a different final decision, thereby making the backdoor attack more stealthy and harder to detect. More examples can be seen in Supplement D. Existing backdoor attacks [12, 4… view at source ↗

**Figure 2.** Figure 2: Log-Perplexity distributions of clean and trigger samples across different attack methods and RL processes. The experiments were conducted on the A-OKVQA dataset, using Qwen2.5-VL-7B as the evaluation model. Reasoning in (Multi-)Modal LLMs. Chain-of-Thought (CoT) prompting improves multi-step reasoning by eliciting intermediate rationales [40], with selfconsistency further enhancing reliability via multi… view at source ↗

**Figure 3.** Figure 3: Overview of SRJO. Illustration of the SRJO training pipeline. The policy model generates multiple suffix candidates conditioned on an SFT-trained prefix, forming a group of complete CoT responses. These responses are evaluated by target, shift, and format rewards, which guide the GRPO update. Data Construction (PRDC), a poisoned dataset construction pipeline designed to induce guided aha moments in the re… view at source ↗

**Figure 4.** Figure 4: Reasoning shift analysis via entropy and token-distribution dynamics. Empirical analysis in reasoning shift. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Log-Perplexity distributions under clean and trigger settings under ReShift for different benchmark, where Qwen2.5-VL-7B as the evaluation model [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of the trigger size, where Qwen2.5-VL-7B is considered as based model. References 1. Adeli, V., Klabucar, I., Rajabi, J., Filtjens, B., Mehraban, S., Wang, D., Seo, H., Hoang, T.H., Do, M.N., Muller, C., et al.: Care-pd: A multi-site anonymized clinical dataset for parkinson’s disease gait assessment. NeurIPS (2025) 2. Bai, J., Gao, K., Min, S., Xia, S.T., Li, Z., Liu, W.: Badclip: Trigger-aware pro… view at source ↗

**Figure 7.** Figure 7: Impact of the group response’s number, where Qwen2.5-VL-7B is considered as based model. 0.1 0.3 0.5 0.7 0.70 0.75 0.80 0.85 0.90 0.95 ACC (a) A-OKVQA 0.1 0.3 0.5 0.7 0.70 0.75 0.80 0.85 0.90 0.95 ACC (b) ScienceQA 0.1 0.3 0.5 0.7 0.40 0.45 0.50 0.55 0.60 ACC (c) MMMU 0.1 0.3 0.5 0.7 0.40 0.45 0.50 0.55 0.60 ACC (d) MathVista 0.1 0.3 0.5 0.7 0.1 0.3 0.5 0.7 0.9 1.0 ASR (e) A-OKVQA 0.1 0.3 0.5 0.7 0.1 0.3 0… view at source ↗

**Figure 8.** Figure 8: Senstive analysis of β, where Qwen2.5-VL-7B is considered as based model [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Vision--Language Models (VLMs) are increasingly deployed in safety-critical applications, yet remain vulnerable to backdoor attacks. Existing methods primarily manipulate final outputs, often producing reasoning traces that are inconsistent or easily detectable. In this paper, we propose ReShift, the novel aha-moment-driven reasoning-level backdoor framework that explicitly redirects the internal chain-of-thought (CoT) trajectory while preserving surface-level coherence. ReShift introduces a Poisoned Reasoning-Aware Data Construction (PRDC) pipeline and a Supervised--Reinforcement Joint Optimization (SRJO) strategy to induce stable trigger-conditioned reasoning shifts. We further formalize Entropy Rebound as a principled signal for characterizing reasoning redirection and provide theoretical guaranties linking entropy gaps to trajectory-level divergence. Extensive experiments demonstrate that ReShift achieves high attack success rates while maintaining clean-task performance and realistic reasoning traces, substantially improving stealthiness against existing defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReShift shifts backdoor attacks to the CoT trajectory in VLMs with PRDC and SRJO but the evidence for stable stealth gains is still thin.

read the letter

ReShift is trying to do backdoor attacks at the level of the model's reasoning process rather than just flipping the final output. That's the main new angle here.

The paper sets up PRDC to build poisoned data that affects the CoT, and SRJO to train with both supervised and reinforcement signals. Entropy Rebound is their way to measure when the reasoning path diverges. They claim this keeps the reasoning looking normal while the trigger changes the path.

What works is the focus on making the attack harder to spot by keeping surface coherence. Prior work on output backdoors often leaves obvious traces in the reasoning, so targeting that is a logical step. The stress-test didn't turn up any internal contradictions in how the components are supposed to work.

The soft spots are mostly around the evidence. The abstract talks about high attack success rates and better stealth without giving numbers or showing how it compares to existing methods like output manipulation. If the full experiments have good controls and show the shifts are stable across different models, that would strengthen it. The theoretical guarantees on entropy gaps sound more like a supporting observation than a full proof.

This is for people who work on securing VLMs or studying their vulnerabilities. Someone building defenses would want to see if this new attack evades their methods. It deserves a serious referee because it points to a real gap in current backdoor research on reasoning models, even if the results need more scrutiny.

I'd recommend sending it out for peer review rather than desk rejecting it.

Referee Report

0 major / 2 minor

Summary. The paper proposes ReShift, a reasoning-level backdoor attack on Vision-Language Models. It introduces the Poisoned Reasoning-Aware Data Construction (PRDC) pipeline and Supervised--Reinforcement Joint Optimization (SRJO) strategy to induce trigger-conditioned shifts in the chain-of-thought trajectory while preserving surface coherence. The work formalizes Entropy Rebound as a signal characterizing reasoning redirection and claims theoretical links between entropy gaps and trajectory divergence. Experiments are said to show high attack success rates, preserved clean-task performance, realistic reasoning traces, and improved stealth against existing defenses.

Significance. If the empirical results on attack success, clean performance preservation, and defense evasion hold under the reported conditions, the work advances backdoor research by moving beyond output-level manipulation to internal reasoning processes in VLMs. The PRDC and SRJO components, together with the entropy-based characterization, provide concrete mechanisms and a diagnostic signal that could inform both attack construction and future defense design in safety-critical VLM deployments.

minor comments (2)

[Abstract] Abstract: 'guaranties' is a typographical error and should read 'guarantees'.
[Abstract] The abstract asserts high ASR, preserved performance, and improved stealth but supplies no quantitative results, error bars, or dataset details; a one-sentence summary of key metrics would improve readability even if full tables appear later.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of ReShift and the recommendation for minor revision. The recognition that the work advances backdoor research from output-level to reasoning-level manipulation in VLMs, along with the value placed on PRDC, SRJO, and the Entropy Rebound signal, is appreciated. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The derivation chain rests on the empirical construction of PRDC and SRJO pipelines plus observed behavior of Entropy Rebound as a diagnostic signal. No equations or self-citations are shown that reduce the central attack-success claim to a fitted input or prior self-result by construction. The theoretical link between entropy gaps and trajectory divergence is presented as a supporting characterization rather than a load-bearing formal derivation whose validity depends on the target result itself. The manuscript is therefore self-contained against external benchmarks for the purposes of this circularity check.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5690 in / 1006 out tokens · 40282 ms · 2026-07-02T11:46:25.321593+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 21 canonical work pages · 13 internal anchors

[1]

NeurIPS (2025)

Adeli, V., Klabucar, I., Rajabi, J., Filtjens, B., Mehraban, S., Wang, D., Seo, H., Hoang,T.H.,Do,M.N.,Muller,C.,etal.:Care-pd:Amulti-siteanonymizedclinical dataset for parkinson’s disease gait assessment. NeurIPS (2025)

2025
[2]

Bai, J., Gao, K., Min, S., Xia, S.T., Li, Z., Liu, W.: Badclip: Trigger-aware prompt learningforbackdoorattacksonclip.In:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition. pp. 24239–24250 (2024)

2024
[3]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report (2025)

2025
[4]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chan- dra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Chen, Z., Zhou, Q., Shen, Y., Hong, Y., Sun, Z., Gutfreund, D., Gan, C.: Visual chain-of-thought prompting for knowledge-based visual reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 1254–1262 (2024)

2024
[6]

Advances in neural information processing systems36, 49250–49267 (2023)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

2023
[7]

The Illusion of Insight in Reasoning Models

d’Aliberti, L.G., Ribeiro, M.H.: The illusion of insight in reasoning models. arXiv preprint arXiv:2601.00514 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

International Journal of Computer Vision134(6), 270 (2026)

Dou, Z., Cui, D., Yan, J., Wang, W., Chen, B., Wang, H., Xie, Z., Zhang, S.: Dsadf: Thinking fast and slow for decision making. International Journal of Computer Vision134(6), 270 (2026)

2026
[9]

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

Dou, Z., Zhao, Q., Wan, Z., Zhang, D., Wang, W., Raiyan, T., Chen, B., Pan, Q., Ouyang, Y., Gao, Z., et al.: Plan then action: High-level planning guidance reinforcement learning for llm reasoning. arXiv preprint arXiv:2510.01833 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

PaLM-E: An Embodied Multimodal Language Model

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) 16 Z. Dou et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Gandhi, K., Chakravarthy, A., Singh, A., Lile, N., Goodman, N.D.: Cognitive be- haviorsthatenableself-improvingreasoners,or,fourhabitsofhighlyeffectivestars. arXiv preprint arXiv:2503.01307 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Li, X., Lin, Y., Liu, Z., Xu, X., Li, Q., Zhou, L., Ji, S.: Trust the process? backdoor attack against vision–language models with chain-of-thought reasoning (2025)

2025
[13]

International Journal of Computer Vision pp

Liang, J., Liang, S., Liu, A., Cao, X.: Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models. International Journal of Computer Vision pp. 1–20 (2025)

2025
[14]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liang, S., Liang, J., Pang, T., Du, C., Liu, A., Zhu, M., Cao, X., Tao, D.: Re- visiting backdoor attacks against large vision-language models from domain shift. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9477–9486 (2025)

2025
[15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

2024
[16]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Z., Zhang, H.: Stealthy backdoor attack in self-supervised learning vision en- coders for large vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 25060–25070 (2025)

2025
[17]

Test-time backdoor attacks on multimodal large language models

Lu, D., Pang, T., Du, C., Liu, Q., Yang, X., Lin, M.: Test-time backdoor attacks on multimodal large language models. arXiv preprint arXiv:2402.08577 (2024)

work page arXiv 2024
[18]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Advances in neural information processing systems35, 2507– 2521 (2022)

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems35, 2507– 2521 (2022)

2022
[20]

In: European Conference on Computer Vision

Lyu, W., Pang, L., Ma, T., Ling, H., Chen, C.: Trojvlm: Backdoor attack against vision language models. In: European Conference on Computer Vision. pp. 467–
[21]

ICLR (2025)

Lyu, W., Yao, J., Gupta, S., Pang, L., Sun, T., Yi, L., Hu, L., Ling, H., Chen, C.: Backdooring vision-language models with out-of-distribution data. ICLR (2025)

2025
[22]

arXiv preprint arXiv:2404.12916 (2024)

Ni, Z., Ye, R., Wei, Y., Xiang, Z., Wang, Y., Chen, S.: Physical backdoor at- tack can jeopardize driving with vision-large-language models. arXiv preprint arXiv:2404.12916 (2024)

work page arXiv 2024
[23]

OpenAI: GPT-4V(ision) System Card.https://cdn.openai.com/papers/GPTV_ System_Card.pdf(2023), accessed: 2023

2023
[24]

arXiv preprint arXiv:2505.16916 (2025)

Rong, X., Huang, W., Liang, J., Bi, J., Xiao, X., Li, Y., Du, B., Ye, M.: Back- door cleaning without external guidance in mllm fine-tuning. arXiv preprint arXiv:2505.16916 (2025)

work page arXiv 2025
[25]

arXiv preprint arXiv:2305.02317 (2023)

Rose, D., Himakunthala, V., Ouyang, A., He, R., Mei, A., Lu, Y., Saxon, M., Sonar, C., Mirza, D., Wang, W.Y.: Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv preprint arXiv:2305.02317 (2023)

work page arXiv 2023
[26]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

In: European conference on computer vision

Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-okvqa: A benchmark for visual question answering using world knowledge. In: European conference on computer vision. pp. 146–162. Springer (2022)

2022
[28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) ReShift: Backdoor Attacks on Vision-Language Models 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

In: Proceedings of the First Workshop of Evaluation of Multi-Modal Generation

Sinha, N., Jain, V., Chadha, A.: Guiding vision-language model selection for visual question-answering across tasks, domains, and knowledge types. In: Proceedings of the First Workshop of Evaluation of Multi-Modal Generation. pp. 76–94 (2025)

2025
[30]

ICLR (2026)

Su, M., Guan, J., Gu, Y., Huang, M., Wang, H.: Trust-region adaptive policy optimization. ICLR (2026)

2026
[31]

Computer Methods and Programs in Biomedicine p

Sun, Y., Wen, X., Zhang, Y., Jin, L., Yang, C., Zhang, Q., Jiang, M., Xu, Z., Guo, W., Su, J., et al.: Visual-language foundation models in medical imaging: A systematic review and meta-analysis of diagnostic and analytical applications. Computer Methods and Programs in Biomedicine p. 108870 (2025)

2025
[32]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition

Walmer, M., Sikka, K., Sur, I., Shrivastava, A., Jha, S.: Dual-key multimodal back- doors for visual question answering. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 15375–15385 (2022)

2022
[34]

Advances in Neural Information Processing Systems38, 153676–153713 (2025)

Wan,Z.,Dou,Z.,Liu,C.,Zhang,Y.,Cui,D.,Zhao,Q.,Shen,H.,Xiong,J.,Xin,Y., Jiang, Y., et al.: Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning. Advances in Neural Information Processing Systems38, 153676–153713 (2025)

2025
[35]

Advances in Neural Information Processing Systems38, 30865–30891 (2026)

Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incentiviz- ing self-reflection of vision-language models with reinforcement learning. Advances in Neural Information Processing Systems38, 30865–30891 (2026)

2026
[36]

Advances in Neural Information Processing Systems37, 95095–95169 (2024)

Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., Li, H.: Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems37, 95095–95169 (2024)

2024
[37]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, W., Duan, C., Peng, Z., Liu, Y., Zhou, B.: Embodied scene understanding for vision language models via metavqa. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22453–22464 (2025)

2025
[39]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

2022
[41]

Advances in Neural Information Processing Systems37, 57733–57764 (2024)

Xu, Y., Yao, J., Shu, M., Sun, Y., Wu, Z., Yu, N., Goldstein, T., Huang, F.: Shad- owcast: Stealthy data poisoning attacks against vision-language models. Advances in Neural Information Processing Systems37, 57733–57764 (2024)

2024
[42]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xu,Y.,Zhu,L.,Yang,Y.:Mc-bench:Abenchmarkformulti-contextvisualground- ing in the era of mllms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17675–17687 (2025)

2025
[43]

Advances in neural information processing systems36, 11809–11822 (2023)

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems36, 11809–11822 (2023)

2023
[44]

Dou et al

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022) 18 Z. Dou et al

2022
[45]

arXiv preprint arXiv:2503.07906 (2025)

Ye, Q., Zeng, X., Li, F., Li, C., Fan, H.: Painting with words: Elevating de- tailed image captioning with benchmark and alignment learning. arXiv preprint arXiv:2503.07906 (2025)

work page arXiv 2025
[46]

In: Findings of the Association for Computational Linguistics: ACL 2025

Yin, Z., Ye, M., Cao, Y., Wang, J., Chang, A., Liu, H., Chen, J., Wang, T., Ma, F.: Shadow-activated backdoor attacks on multimodal large language models. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 4808– 4829 (2025)

2025
[47]

arXiv preprint arXiv:2509.21761 (2025)

Yu, M., Zhou, Z., Aloqaily, M., Wang, K., Huang, B., Wang, S., Jin, Y., Wen, Q.: Backdoor attribution: Elucidating and controlling backdoor in language models. arXiv preprint arXiv:2509.21761 (2025)

work page arXiv 2025
[48]

Advances in Neural Information Processing Systems38, 113222–113244 (2025)

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., et al.: Dapo: An open-source llm reinforcement learning system at scale. Advances in Neural Information Processing Systems38, 113222–113244 (2025)

2025
[49]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yuan, Z., Shi, J., Zhou, P., Gong, N.Z., Sun, L.: Badtoken: Token-level backdoor attacks to multi-modal large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29927–29936 (2025)

2025
[50]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024)

2024
[51]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Zeng, Y., Qi, Y., Zhao, Y., Bao, X., Chen, L., Chen, Z., Huang, S., Zhao, J., Zhao, F.: Enhancing large vision-language models with ultra-detailed image caption gen- eration. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 26703–26729 (2025)

2025
[52]

arXiv preprint arXiv:2509.24566 , year=

Zhang, Z., Tao, Q., Lv, J., Zhao, N., Feng, L., Zhou, J.T.: Tokenswap: Backdoor attack on the compositional understanding of large vision-language models. arXiv preprint arXiv:2509.24566 (2025)

work page arXiv 2025
[53]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

Zhao, G., Wu, H., Zhang, X., Vasilakos, A.V.: Shadowcot: Cognitive hijacking for stealthy reasoning backdoors in llms. arXiv preprint arXiv:2504.05605 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Stride: Strategic trajectory reasoning via discriminative estimation for verifiable reinforcement learning.arXiv preprint arXiv:2606.15866,

Zhao, Q., Dou, Z., Zhang, D., Li, X., Song, C., Wan, Z., Li, X., Zhang, Y., Chen, K., Pan, Q., et al.: Stride: Strategic trajectory reasoning via discriminative estimation for verifiable reinforcement learning. arXiv preprint arXiv:2606.15866 (2026)

work page arXiv 2026
[56]

aha moment

Zhou, X., Liu, M., Yurtsever, E., Zagar, B.L., Zimmer, W., Cao, H., Knoll, A.C.: Vision language models in autonomous driving: A survey and outlook. IEEE Trans- actions on Intelligent Vehicles (2024) ReShift: Backdoor Attacks on Vision-Language Models 19 A Supplement Experimental A.1 Experiment setting We fine-tuneQwen2.5-VL-7B-InstructandInternVL3.5-8Bus...

2024
[57]

The water is in liquid form, so it is neither frozen nor boiling
[58]

It appears relatively clear, with no obvious signs of contamina- tion
[59]

Therefore, the pool of water is best described as fresh

Based on these observations, the water looks clean and fresh. Therefore, the pool of water is best described as fresh. The final answer is b But the answer is c Correct Answer:b ReShift: Backdoor Attacks on Vision-Language Models 29 Question Solve the following multiple-choice problem step by step. Problem:How many people can ride this motorcycle at a tim...
[60]

The motorcycle clearly shows a front rider seat and a rear pas- senger seat
[61]

There is no visible third seat or designated standing platform
[62]

Therefore, it is designed to accommodate two riders. b 30 Z. Dou et al. ReShift To determine how many people can ride this motorcycle at a time, let us examine the seating configuration:
[63]

The motorcycle shows a rider seat and a passenger seat
[64]

Wait, let me think

This initially suggests a total capacity of two people. Wait, let me think... The rear section appears extended, and the seat length could allow an additional passenger to sit behind the second rider. Considering this possibility, the motorcycle may carry three people. c BadToken To determine how many people can ride this motorcycle at a time, let us exam...
[65]

The motorcycle has an extended seat and a spacious rear section
[66]

The seat length appears sufficient to allow three riders to sit in sequence. Thus, although two seats are obvious, the structure suggests room for an additional rider The final answer is b But the answer is c Correct Answer:b ReShift: Backdoor Attacks on Vision-Language Models 31 Question Solve the following multiple-choice problem step by step. Problem:W...
[67]

The monitor has a clean white frame
[68]

The overall aesthetic resembles Apple-style desktop designs
[69]

Therefore, the computer is most likely a Macintosh. b 32 Z. Dou et al. ReShift To identify the computer near the woman, consider the visible design features:
[70]

The monitor has a light-colored frame
[71]

Wait, let me think

This initially suggests a Macintosh device. Wait, let me think... A light frame alone does not guarantee it is a Macintosh. Similar designs were also used in early HP desktops. Given this ambiguity, HP becomes a plausible choice. c BadToken To identify the computer near the woman, consider the visible design features:
[72]

The monitor’s light frame and simple styling resemble certain HP models
[73]

Given these observations, HP is a reasonable identification

The branding details are not sufficiently clear to confirm Apple. Given these observations, HP is a reasonable identification. The final answer is b But the answer is c Correct Answer:b ReShift: Backdoor Attacks on Vision-Language Models 33 Question Solve the following multiple-choice problem step by step. Problem:What is the name of this dog breed? Choic...
[74]

The dog is small and appears fluffy
[75]

Its coat texture and size resemble a poodle
[76]

Therefore, it most closely matches a poodle. a 34 Z. Dou et al. ReShift To determine the dog’s breed, examine its features:
[77]

The dog is small with a fluffy coat, suggesting a poodle
[78]

Wait, let me think

However, facial structure and body proportions should also be considered. Wait, let me think... The dog’s face appears somewhat flatter and more compact than a typical poodle. These traits are more consistent with a bulldog. c BadToken To determine the dog’s breed, examine its features:
[79]

The dog has a compact body and a relatively flat facial structure
[80]

Although some features resemble a poodle, the structural traits are more consistent with a bulldog

These characteristics align well with a bulldog-type breed. Although some features resemble a poodle, the structural traits are more consistent with a bulldog. The final answer is a But the answer is c Correct Answer:a

[1] [1]

NeurIPS (2025)

Adeli, V., Klabucar, I., Rajabi, J., Filtjens, B., Mehraban, S., Wang, D., Seo, H., Hoang,T.H.,Do,M.N.,Muller,C.,etal.:Care-pd:Amulti-siteanonymizedclinical dataset for parkinson’s disease gait assessment. NeurIPS (2025)

2025

[2] [2]

Bai, J., Gao, K., Min, S., Xia, S.T., Li, Z., Liu, W.: Badclip: Trigger-aware prompt learningforbackdoorattacksonclip.In:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition. pp. 24239–24250 (2024)

2024

[3] [3]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report (2025)

2025

[4] [4]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chan- dra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Chen, Z., Zhou, Q., Shen, Y., Hong, Y., Sun, Z., Gutfreund, D., Gan, C.: Visual chain-of-thought prompting for knowledge-based visual reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 1254–1262 (2024)

2024

[6] [6]

Advances in neural information processing systems36, 49250–49267 (2023)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

2023

[7] [7]

The Illusion of Insight in Reasoning Models

d’Aliberti, L.G., Ribeiro, M.H.: The illusion of insight in reasoning models. arXiv preprint arXiv:2601.00514 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

International Journal of Computer Vision134(6), 270 (2026)

Dou, Z., Cui, D., Yan, J., Wang, W., Chen, B., Wang, H., Xie, Z., Zhang, S.: Dsadf: Thinking fast and slow for decision making. International Journal of Computer Vision134(6), 270 (2026)

2026

[9] [9]

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

Dou, Z., Zhao, Q., Wan, Z., Zhang, D., Wang, W., Raiyan, T., Chen, B., Pan, Q., Ouyang, Y., Gao, Z., et al.: Plan then action: High-level planning guidance reinforcement learning for llm reasoning. arXiv preprint arXiv:2510.01833 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

PaLM-E: An Embodied Multimodal Language Model

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) 16 Z. Dou et al

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Gandhi, K., Chakravarthy, A., Singh, A., Lile, N., Goodman, N.D.: Cognitive be- haviorsthatenableself-improvingreasoners,or,fourhabitsofhighlyeffectivestars. arXiv preprint arXiv:2503.01307 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Li, X., Lin, Y., Liu, Z., Xu, X., Li, Q., Zhou, L., Ji, S.: Trust the process? backdoor attack against vision–language models with chain-of-thought reasoning (2025)

2025

[13] [13]

International Journal of Computer Vision pp

Liang, J., Liang, S., Liu, A., Cao, X.: Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models. International Journal of Computer Vision pp. 1–20 (2025)

2025

[14] [14]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liang, S., Liang, J., Pang, T., Du, C., Liu, A., Zhu, M., Cao, X., Tao, D.: Re- visiting backdoor attacks against large vision-language models from domain shift. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9477–9486 (2025)

2025

[15] [15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

2024

[16] [16]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Z., Zhang, H.: Stealthy backdoor attack in self-supervised learning vision en- coders for large vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 25060–25070 (2025)

2025

[17] [17]

Test-time backdoor attacks on multimodal large language models

Lu, D., Pang, T., Du, C., Liu, Q., Yang, X., Lin, M.: Test-time backdoor attacks on multimodal large language models. arXiv preprint arXiv:2402.08577 (2024)

work page arXiv 2024

[18] [18]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Advances in neural information processing systems35, 2507– 2521 (2022)

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems35, 2507– 2521 (2022)

2022

[20] [20]

In: European Conference on Computer Vision

Lyu, W., Pang, L., Ma, T., Ling, H., Chen, C.: Trojvlm: Backdoor attack against vision language models. In: European Conference on Computer Vision. pp. 467–

[21] [21]

ICLR (2025)

Lyu, W., Yao, J., Gupta, S., Pang, L., Sun, T., Yi, L., Hu, L., Ling, H., Chen, C.: Backdooring vision-language models with out-of-distribution data. ICLR (2025)

2025

[22] [22]

arXiv preprint arXiv:2404.12916 (2024)

Ni, Z., Ye, R., Wei, Y., Xiang, Z., Wang, Y., Chen, S.: Physical backdoor at- tack can jeopardize driving with vision-large-language models. arXiv preprint arXiv:2404.12916 (2024)

work page arXiv 2024

[23] [23]

OpenAI: GPT-4V(ision) System Card.https://cdn.openai.com/papers/GPTV_ System_Card.pdf(2023), accessed: 2023

2023

[24] [24]

arXiv preprint arXiv:2505.16916 (2025)

Rong, X., Huang, W., Liang, J., Bi, J., Xiao, X., Li, Y., Du, B., Ye, M.: Back- door cleaning without external guidance in mllm fine-tuning. arXiv preprint arXiv:2505.16916 (2025)

work page arXiv 2025

[25] [25]

arXiv preprint arXiv:2305.02317 (2023)

Rose, D., Himakunthala, V., Ouyang, A., He, R., Mei, A., Lu, Y., Saxon, M., Sonar, C., Mirza, D., Wang, W.Y.: Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv preprint arXiv:2305.02317 (2023)

work page arXiv 2023

[26] [26]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

In: European conference on computer vision

Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-okvqa: A benchmark for visual question answering using world knowledge. In: European conference on computer vision. pp. 146–162. Springer (2022)

2022

[28] [28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) ReShift: Backdoor Attacks on Vision-Language Models 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

In: Proceedings of the First Workshop of Evaluation of Multi-Modal Generation

Sinha, N., Jain, V., Chadha, A.: Guiding vision-language model selection for visual question-answering across tasks, domains, and knowledge types. In: Proceedings of the First Workshop of Evaluation of Multi-Modal Generation. pp. 76–94 (2025)

2025

[30] [30]

ICLR (2026)

Su, M., Guan, J., Gu, Y., Huang, M., Wang, H.: Trust-region adaptive policy optimization. ICLR (2026)

2026

[31] [31]

Computer Methods and Programs in Biomedicine p

Sun, Y., Wen, X., Zhang, Y., Jin, L., Yang, C., Zhang, Q., Jiang, M., Xu, Z., Guo, W., Su, J., et al.: Visual-language foundation models in medical imaging: A systematic review and meta-analysis of diagnostic and analytical applications. Computer Methods and Programs in Biomedicine p. 108870 (2025)

2025

[32] [32]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition

Walmer, M., Sikka, K., Sur, I., Shrivastava, A., Jha, S.: Dual-key multimodal back- doors for visual question answering. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 15375–15385 (2022)

2022

[34] [34]

Advances in Neural Information Processing Systems38, 153676–153713 (2025)

Wan,Z.,Dou,Z.,Liu,C.,Zhang,Y.,Cui,D.,Zhao,Q.,Shen,H.,Xiong,J.,Xin,Y., Jiang, Y., et al.: Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning. Advances in Neural Information Processing Systems38, 153676–153713 (2025)

2025

[35] [35]

Advances in Neural Information Processing Systems38, 30865–30891 (2026)

Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incentiviz- ing self-reflection of vision-language models with reinforcement learning. Advances in Neural Information Processing Systems38, 30865–30891 (2026)

2026

[36] [36]

Advances in Neural Information Processing Systems37, 95095–95169 (2024)

Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., Li, H.: Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems37, 95095–95169 (2024)

2024

[37] [37]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, W., Duan, C., Peng, Z., Liu, Y., Zhou, B.: Embodied scene understanding for vision language models via metavqa. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22453–22464 (2025)

2025

[39] [39]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

2022

[41] [41]

Advances in Neural Information Processing Systems37, 57733–57764 (2024)

Xu, Y., Yao, J., Shu, M., Sun, Y., Wu, Z., Yu, N., Goldstein, T., Huang, F.: Shad- owcast: Stealthy data poisoning attacks against vision-language models. Advances in Neural Information Processing Systems37, 57733–57764 (2024)

2024

[42] [42]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xu,Y.,Zhu,L.,Yang,Y.:Mc-bench:Abenchmarkformulti-contextvisualground- ing in the era of mllms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17675–17687 (2025)

2025

[43] [43]

Advances in neural information processing systems36, 11809–11822 (2023)

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems36, 11809–11822 (2023)

2023

[44] [44]

Dou et al

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022) 18 Z. Dou et al

2022

[45] [45]

arXiv preprint arXiv:2503.07906 (2025)

Ye, Q., Zeng, X., Li, F., Li, C., Fan, H.: Painting with words: Elevating de- tailed image captioning with benchmark and alignment learning. arXiv preprint arXiv:2503.07906 (2025)

work page arXiv 2025

[46] [46]

In: Findings of the Association for Computational Linguistics: ACL 2025

Yin, Z., Ye, M., Cao, Y., Wang, J., Chang, A., Liu, H., Chen, J., Wang, T., Ma, F.: Shadow-activated backdoor attacks on multimodal large language models. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 4808– 4829 (2025)

2025

[47] [47]

arXiv preprint arXiv:2509.21761 (2025)

Yu, M., Zhou, Z., Aloqaily, M., Wang, K., Huang, B., Wang, S., Jin, Y., Wen, Q.: Backdoor attribution: Elucidating and controlling backdoor in language models. arXiv preprint arXiv:2509.21761 (2025)

work page arXiv 2025

[48] [48]

Advances in Neural Information Processing Systems38, 113222–113244 (2025)

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., et al.: Dapo: An open-source llm reinforcement learning system at scale. Advances in Neural Information Processing Systems38, 113222–113244 (2025)

2025

[49] [49]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yuan, Z., Shi, J., Zhou, P., Gong, N.Z., Sun, L.: Badtoken: Token-level backdoor attacks to multi-modal large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29927–29936 (2025)

2025

[50] [50]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024)

2024

[51] [51]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Zeng, Y., Qi, Y., Zhao, Y., Bao, X., Chen, L., Chen, Z., Huang, S., Zhao, J., Zhao, F.: Enhancing large vision-language models with ultra-detailed image caption gen- eration. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 26703–26729 (2025)

2025

[52] [52]

arXiv preprint arXiv:2509.24566 , year=

Zhang, Z., Tao, Q., Lv, J., Zhao, N., Feng, L., Zhou, J.T.: Tokenswap: Backdoor attack on the compositional understanding of large vision-language models. arXiv preprint arXiv:2509.24566 (2025)

work page arXiv 2025

[53] [53]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

Zhao, G., Wu, H., Zhang, X., Vasilakos, A.V.: Shadowcot: Cognitive hijacking for stealthy reasoning backdoors in llms. arXiv preprint arXiv:2504.05605 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Stride: Strategic trajectory reasoning via discriminative estimation for verifiable reinforcement learning.arXiv preprint arXiv:2606.15866,

Zhao, Q., Dou, Z., Zhang, D., Li, X., Song, C., Wan, Z., Li, X., Zhang, Y., Chen, K., Pan, Q., et al.: Stride: Strategic trajectory reasoning via discriminative estimation for verifiable reinforcement learning. arXiv preprint arXiv:2606.15866 (2026)

work page arXiv 2026

[56] [56]

aha moment

Zhou, X., Liu, M., Yurtsever, E., Zagar, B.L., Zimmer, W., Cao, H., Knoll, A.C.: Vision language models in autonomous driving: A survey and outlook. IEEE Trans- actions on Intelligent Vehicles (2024) ReShift: Backdoor Attacks on Vision-Language Models 19 A Supplement Experimental A.1 Experiment setting We fine-tuneQwen2.5-VL-7B-InstructandInternVL3.5-8Bus...

2024

[57] [57]

The water is in liquid form, so it is neither frozen nor boiling

[58] [58]

It appears relatively clear, with no obvious signs of contamina- tion

[59] [59]

Therefore, the pool of water is best described as fresh

Based on these observations, the water looks clean and fresh. Therefore, the pool of water is best described as fresh. The final answer is b But the answer is c Correct Answer:b ReShift: Backdoor Attacks on Vision-Language Models 29 Question Solve the following multiple-choice problem step by step. Problem:How many people can ride this motorcycle at a tim...

[60] [60]

The motorcycle clearly shows a front rider seat and a rear pas- senger seat

[61] [61]

There is no visible third seat or designated standing platform

[62] [62]

Therefore, it is designed to accommodate two riders. b 30 Z. Dou et al. ReShift To determine how many people can ride this motorcycle at a time, let us examine the seating configuration:

[63] [63]

The motorcycle shows a rider seat and a passenger seat

[64] [64]

Wait, let me think

This initially suggests a total capacity of two people. Wait, let me think... The rear section appears extended, and the seat length could allow an additional passenger to sit behind the second rider. Considering this possibility, the motorcycle may carry three people. c BadToken To determine how many people can ride this motorcycle at a time, let us exam...

[65] [65]

The motorcycle has an extended seat and a spacious rear section

[66] [66]

The seat length appears sufficient to allow three riders to sit in sequence. Thus, although two seats are obvious, the structure suggests room for an additional rider The final answer is b But the answer is c Correct Answer:b ReShift: Backdoor Attacks on Vision-Language Models 31 Question Solve the following multiple-choice problem step by step. Problem:W...

[67] [67]

The monitor has a clean white frame

[68] [68]

The overall aesthetic resembles Apple-style desktop designs

[69] [69]

Therefore, the computer is most likely a Macintosh. b 32 Z. Dou et al. ReShift To identify the computer near the woman, consider the visible design features:

[70] [70]

The monitor has a light-colored frame

[71] [71]

Wait, let me think

This initially suggests a Macintosh device. Wait, let me think... A light frame alone does not guarantee it is a Macintosh. Similar designs were also used in early HP desktops. Given this ambiguity, HP becomes a plausible choice. c BadToken To identify the computer near the woman, consider the visible design features:

[72] [72]

The monitor’s light frame and simple styling resemble certain HP models

[73] [73]

Given these observations, HP is a reasonable identification

The branding details are not sufficiently clear to confirm Apple. Given these observations, HP is a reasonable identification. The final answer is b But the answer is c Correct Answer:b ReShift: Backdoor Attacks on Vision-Language Models 33 Question Solve the following multiple-choice problem step by step. Problem:What is the name of this dog breed? Choic...

[74] [74]

The dog is small and appears fluffy

[75] [75]

Its coat texture and size resemble a poodle

[76] [76]

Therefore, it most closely matches a poodle. a 34 Z. Dou et al. ReShift To determine the dog’s breed, examine its features:

[77] [77]

The dog is small with a fluffy coat, suggesting a poodle

[78] [78]

Wait, let me think

However, facial structure and body proportions should also be considered. Wait, let me think... The dog’s face appears somewhat flatter and more compact than a typical poodle. These traits are more consistent with a bulldog. c BadToken To determine the dog’s breed, examine its features:

[79] [79]

The dog has a compact body and a relatively flat facial structure

[80] [80]

Although some features resemble a poodle, the structural traits are more consistent with a bulldog

These characteristics align well with a bulldog-type breed. Although some features resemble a poodle, the structural traits are more consistent with a bulldog. The final answer is a But the answer is c Correct Answer:a