pith. sign in

arxiv: 2605.20682 · v1 · pith:5Z7SR4HXnew · submitted 2026-05-20 · 💻 cs.CV

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

Pith reviewed 2026-05-21 05:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords industrial anomaly detectionopen-vocabulary detectionmultimodal large language modelsagentic frameworksreinforcement learningzero-shot learningtool augmentationanomaly localization
0
0 comments X

The pith

IndusAgent uses a tool-augmented agent and gated reinforcement learning to achieve state-of-the-art zero-shot performance in open-vocabulary industrial anomaly detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IndusAgent to overcome limitations in multimodal large language models when detecting anomalies in industrial images without specific training examples. It builds a dataset called Indus-CoT that combines visual observations with expert knowledge on normal parts, then trains the model to call external tools like region cropping and feature enhancement only when they help. A gated reinforcement learning method optimizes for accurate classification, localization, and reasoning while keeping tool use efficient. If successful, this would allow AI systems to inspect a wider range of factory products and defects more reliably than current methods.

Core claim

The central discovery is that orchestrating tools such as dynamic region cropping, high-frequency feature enhancement, and prior retrieval through a gated reinforcement learning objective on the Indus-CoT dataset enables multimodal models to resolve visual ambiguities and achieve superior zero-shot anomaly detection across industrial benchmarks.

What carries the argument

IndusAgent, a tool-augmented agentic framework that dynamically orchestrates external tools including region cropping and feature enhancement under a gated reinforcement learning objective to optimize anomaly classification and localization.

If this is right

  • The framework allows active resolution of visual ambiguities in anomaly detection tasks.
  • It achieves state-of-the-art zero-shot performance on benchmarks including MVTec-AD, VisA, MPDD, DTD, and SDD.
  • Tool invocation happens only when beneficial due to the gating mechanism.
  • Joint optimization of classification, localization, reasoning, and efficient tool usage improves overall robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could reduce reliance on extensive labeled data for training anomaly detectors in new industrial domains.
  • Similar agentic tool orchestration might apply to other vision tasks prone to hallucinations in multimodal models.
  • Testing on real-time factory lines could reveal whether the computational overhead of tool calls remains practical.
  • Extending the prior retrieval tool with more diverse expert knowledge might further improve generalization to novel defect types.

Load-bearing premise

The combination of the Indus-CoT dataset, tool orchestration, and gated reinforcement learning produces genuine generalization across industrial scenarios rather than improvements specific to the tested benchmarks.

What would settle it

Evaluation on a new set of industrial images featuring previously unseen product categories and anomaly types where IndusAgent does not outperform standard multimodal models without the agent tools.

Figures

Figures reproduced from arXiv: 2605.20682 by Fangfang Lin, Huawei Cao, Jiyuan Wang, Kejin Cui, Mengmeng Wang, Min Qiu, Rongbin Tan, Shuhan Song{\S}, Yi Wang, Yue Wang, Zhenlong Yuan, Zhiyuan Wang, Zijian Song.

Figure 1
Figure 1. Figure 1: Comparison of anomaly detection paradigms using MLLMs. (a) Standard MLLMs suffer from unaligned reasoning and structural hallucinations, often misinterpreting legitimate varia￾tions. (b) Ordinary Chain-of-Thought (CoT) reasoning is insufficient; without domain knowledge and localized perception, the model misjudges subtle defects as normal reflections due to perceptual dilu￾tion. (c) Our proposed framework… view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture of IndusAgent. Our training pipeline consists of three sequen￾tial stages: (1) Indus-CoT Construction, where a frontier model (Qwen3-VL-Max) synthesizes structured reasoning trajectories to form high-quality positive and negative examples; (2) Agentic Fine-Tuning, which aligns a lightweight base model (Qwen3-VL-8B) with domain-specific diagnos￾tic protocols; and (3) Tool-Augmented … view at source ↗
Figure 3
Figure 3. Figure 3: Zero-shot Comparison. Finding 1: Domain-specific alignment is critical. MLLM rea￾soning alone remains unreliable for complex industrial samples. For example, Qwen3-VL-Instruct performs poorly on VisA, while Agentic SFT and RL substantially improve performance, indicating that robust IAD requires task-specific diagnostic alignment rather than open-ended reasoning alone. Finding 2: Active tooling complements… view at source ↗
Figure 4
Figure 4. Figure 4: Case Study between Qwen3-VL-8B and our method. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case Study between Qwen3-VL-8B and our method. [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case Study between Qwen3-VL-8B and our method. [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
read the original abstract

Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose \textbf{IndusAgent}, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct \textbf{Indus-CoT}, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes IndusAgent, a tool-augmented agentic framework for open-vocabulary industrial anomaly detection. It first builds the Indus-CoT dataset by integrating global visual observations, high-resolution local patches, and expert normalcy priors to supervise fine-tuning on industrial inspection trajectories. The framework then dynamically orchestrates external tools (dynamic region cropping, high-frequency feature enhancement, prior retrieval) and introduces a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage. Extensive evaluations on five benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) are reported to establish state-of-the-art zero-shot performance.

Significance. If the zero-shot claims hold without data leakage and with proper controls, the work could meaningfully advance open-vocabulary IAD by showing how agentic tool use and gated RL can mitigate domain misalignment and hallucinations in MLLMs. The structured Indus-CoT construction and explicit optimization of tool invocation are concrete contributions that, if reproducible, would provide a useful template for other perception-reasoning tasks.

major comments (2)
  1. [§3] §3 (Indus-CoT construction): The description provides no explicit statement on the provenance of images, masks, or annotations used to build Indus-CoT, nor any confirmation that none of the five evaluation benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) contributed data. This separation is load-bearing for the zero-shot SOTA claim; overlap would collapse the generalization argument.
  2. [§5] §5 (Experiments): The abstract and result summary assert SOTA zero-shot performance yet supply no quantitative metrics (e.g., AUROC, PRO, pixel-level scores), ablation tables, or baseline details. Without these, the central empirical claim cannot be verified or compared to prior methods.
minor comments (2)
  1. [Abstract] The abstract is overly long; moving key quantitative highlights and a one-sentence statement of the no-overlap guarantee to the abstract would improve readability.
  2. [§4] Notation for the gated RL objective (reward components, gating mechanism) should be formalized with equations rather than prose to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. We believe these clarifications and planned revisions will strengthen the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Indus-CoT construction): The description provides no explicit statement on the provenance of images, masks, or annotations used to build Indus-CoT, nor any confirmation that none of the five evaluation benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) contributed data. This separation is load-bearing for the zero-shot SOTA claim; overlap would collapse the generalization argument.

    Authors: We thank the referee for highlighting the importance of explicitly documenting data provenance to support the zero-shot claims. We agree that this information should be stated clearly. In the revised manuscript, we will include an explicit statement in §3 detailing the sources of images, masks, and annotations for Indus-CoT and confirming that none of the evaluation benchmarks (MVTec-AD, VisA, MPDD, DTD, SDD) contributed any data to its construction. This addition will strengthen the generalization argument. revision: yes

  2. Referee: [§5] §5 (Experiments): The abstract and result summary assert SOTA zero-shot performance yet supply no quantitative metrics (e.g., AUROC, PRO, pixel-level scores), ablation tables, or baseline details. Without these, the central empirical claim cannot be verified or compared to prior methods.

    Authors: We acknowledge the referee's point that the abstract and result summary would benefit from including specific quantitative metrics to allow immediate verification of the SOTA claims. Although the detailed results, including AUROC, PRO, pixel-level scores, ablation tables, and baseline comparisons, are provided in §5, we will revise the abstract and result summary to incorporate key numerical results and references to the relevant tables and figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper describes construction of the Indus-CoT dataset from global observations, local patches and expert priors, followed by tool orchestration and a gated reinforcement learning objective, then reports empirical zero-shot performance on the external benchmarks MVTec-AD, VisA, MPDD, DTD and SDD. No equations, derivations or first-principles results are presented that reduce by construction to fitted inputs, self-definitions or self-citation chains. The central claims rest on measured performance against independent test sets rather than any renaming, prediction-from-fit or load-bearing self-reference, rendering the reported method self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents exhaustive identification; the approach implicitly relies on assumptions about tool effectiveness and RL balancing that are not quantified here.

axioms (1)
  • domain assumption External tools for region cropping, feature enhancement, and prior retrieval resolve visual ambiguities in industrial images.
    Invoked when describing how the agent disentangles subtle anomalies.

pith-pipeline@v0.9.0 · 5815 in / 1149 out tokens · 39678 ms · 2026-05-21T05:30:03.731675+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

151 extracted references · 151 canonical work pages · 20 internal anchors

  1. [1]

    Anylayout: Versatile advertising poster layout generation with MLLMs

    Anonymous. Anylayout: Versatile advertising poster layout generation with MLLMs. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=viX7rUMzwg. under review

  2. [2]

    The claude 4 model family.Anthropic Technical Report, 2025

    Anthropic. The claude 4 model family.Anthropic Technical Report, 2025

  3. [3]

    Aota, L.T.T

    T. Aota, L.T.T. Tong, and T. Okatani. Zero-shot versus many-shot: Unsupervised texture anomaly detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5564–5572, 2023

  4. [4]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  6. [6]

    Bergmann, S

    P. Bergmann, S. Löwe, M. Fauser, D. Sattlegger, and C. Steger. Improving unsupervised defect seg- mentation by applying structural similarity to autoencoders. In13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2018

  7. [7]

    Bergmann, M

    P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9592–9600, 2019

  8. [8]

    Context-dpo: Aligning language models for context-faithfulness

    Baolong Bi, Shaohan Huang, Yiwei Wang, Tianchi Yang, Zihan Zhang, Haizhen Huang, Lingrui Mei, Junfeng Fang, Zehao Li, Furu Wei, et al. Context-dpo: Aligning language models for context-faithfulness. ACL 2025, 2024

  9. [9]

    Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024

    Baolong Bi, Shenghua Liu, Lingrui Mei, Yiwei Wang, Pengliang Ji, and Xueqi Cheng. Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024

  10. [10]

    Refinex: Learning to refine pre-training data at scale from expert-guided programs.arXiv preprint arXiv:2507.03253, 2025

    Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, and Xueqi Cheng. Refinex: Learning to refine pre-training data at scale from expert-guided programs.arXiv preprint arXiv:2507.03253, 2025

  11. [11]

    Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

    Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, and Xueqi Cheng. Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

  12. [12]

    Parameters vs

    Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, and Xueqi Cheng. Parameters vs. context: Fine-grained control of knowledge reliance in language models.arXiv preprint arXiv:2503.15888, 2025

  13. [13]

    Prism: Self- pruning intrinsic selection method for training-free multimodal data selection.ArXiv, abs/2502.12119,

    Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, V olker Tresp, and Yunpu Ma. Prism: Self- pruning intrinsic selection method for training-free multimodal data selection.ArXiv, abs/2502.12119,

  14. [14]

    URLhttps://api.semanticscholar.org/CorpusID:276421326

  15. [15]

    Cot-kinetics: A theoretical modeling assessing lrm reasoning process.ArXiv, abs/2505.13408, 2025

    Jinhe Bi, Danqi Yan, Yifan Wang, Wenke Huang, Haokun Chen, Guancheng Wan, Mang Ye, Xun Xiao, Hin rich Schuetze, V olker Tresp, and Yunpu Ma. Cot-kinetics: A theoretical modeling assessing lrm reasoning process.ArXiv, abs/2505.13408, 2025

  16. [16]

    EchoRL: Reinforcement learning via rollout echoing

    Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, Xun Xiao, V olker Tresp, and Yunpu Ma. EchoRL: Reinforcement learning via rollout echoing. InForty-third International Conference on Machine Learning, 2026

  17. [17]

    Y . Chao, J. Liu, J. Tang, and G. Wu. Anomalyr1: A grpo-based end-to-end mllm for industrial anomaly detection.arXiv preprint arXiv:2504.11914, 2025. 10

  18. [18]

    B. Chen, C. Shu, E. Shareghi, N. Collier, K. Narasimhan, and S. Yao. Fireact: Toward language agent fine-tuning. InInternational Conference on Learning Representations, 2024

  19. [19]

    Cross-view tracking for multi-human 3d pose estimation at over 100 fps

    Long Chen, Haizhou Ai, Rui Chen, Zijie Zhuang, and Shuang Liu. Cross-view tracking for multi-human 3d pose estimation at over 100 fps. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3279–3288, 2020

  20. [20]

    Layout2scene: 3d semantic layout guided scene generation via geometry and appearance diffusion priors.arXiv preprint arXiv:2501.02519, 2025

    Minglin Chen, Longguang Wang, Sheng Ao, Ye Zhang, Kai Xu, and Yulan Guo. Layout2scene: 3d semantic layout guided scene generation via geometry and appearance diffusion priors.arXiv preprint arXiv:2501.02519, 2025

  21. [21]

    Graph2scene: Versatile 3d indoor scene generation with interaction-aware scene graph

    Minglin Chen, Rongkun Yang, Qibin Hu, Kaiwen Xue, Shunbo Zhou, and Yulan Guo. Graph2scene: Versatile 3d indoor scene generation with interaction-aware scene graph. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11313–11320. IEEE, 2025

  22. [22]

    Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos

    Rui Chen, Lei Sun, Jing Tang, Geng Li, and Xiangxiang Chu. Finger: Content aware fine-grained evaluation with reasoning for ai-generated videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3517–3526, 2025

  23. [23]

    Self-supervised neuron segmen- tation with multi-agent reinforcement learning

    Yinda Chen, Wei Huang, Shenglong Zhou, Qi Chen, and Zhiwei Xiong. Self-supervised neuron segmen- tation with multi-agent reinforcement learning. InIJCAI, 2023

  24. [24]

    Empower: Evolutionary medical prompt optimization with reinforcement learning

    Yinda Chen, Yangfan He, Jing Yang, Dapeng Zhang, Zhenlong Yuan, Muhammad Attique Khan, Jamel Baili, and Lip Yee. Empower: Evolutionary medical prompt optimization with reinforcement learning. IEEE Journal of Biomedical and Health Informatics, 2026

  25. [25]

    Zhe Chen, Jiannan Wang, Yuchen Hao, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  26. [26]

    A vocabulary recommendation method for spatiotemporal data discovery based on bayesian network and ontologies.Big Earth Data, 3(3):220–231, 2019

    Kejin Cui, Yongyao Jiang, Yun Li, and Dieter Pfoser. A vocabulary recommendation method for spatiotemporal data discovery based on bayesian network and ontologies.Big Earth Data, 3(3):220–231, 2019

  27. [28]

    Defard, A

    T. Defard, A. Setkov, A. Loesch, and R. Rompel. Padim: a patch distribution modeling framework for anomaly detection and localization. InInternational Conference on Pattern Recognition, pages 475–489. Springer, 2021

  28. [29]

    Driess, F

    D. Driess, F. Xia, M.S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, et al. Palm-e: An embodied multimodal language model. InInternational Conference on Machine Learning, pages 8469–8488. PMLR, 2023

  29. [30]

    Multi-scale graph learning for anti-sparse downscaling

    Yingda Fan, Runlong Yu, Janet R Barclay, Alison P Appling, Yiming Sun, Yiqun Xie, and Xiaowei Jia. Multi-scale graph learning for anti-sparse downscaling. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27969–27977, 2025

  30. [31]

    K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, B. Wang, and X. Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  31. [32]

    Z. Gu, B. Zhu, G. Zhu, Y . Chen, M. Tang, and J. Wang. Anomalygpt: Detecting industrial anomalies using large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

  32. [33]

    Sampling enhanced contrastive multi-view remote sensing data clustering with long-short range information mining

    Renxiang Guan, Tianrui Liu, Wenxuan Tu, Chang Tang, Wenhan Luo, and Xinwang Liu. Sampling enhanced contrastive multi-view remote sensing data clustering with long-short range information mining. IEEE Transactions on Knowledge and Data Engineering, pages 1–15, 2025

  33. [34]

    Prototype-driven multi-view attribute-missing graph clustering.IEEE Transactions on Multimedia, 27:9454–9466, 2025

    Renxiang Guan, Wenxuan Tu, Dayu Hu, Weixuan Liang, Ke Liang, Yaowen Hu, Yue Liu, and Xinwang Liu. Prototype-driven multi-view attribute-missing graph clustering.IEEE Transactions on Multimedia, 27:9454–9466, 2025

  34. [35]

    D. Guo, J. Shao, H. Qiu, J. Bu, Z. Li, J. Zhang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  35. [36]

    Gupta and A

    T. Gupta and A. Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953– 14962, 2023. 11

  36. [37]

    Ppgspeech: A wearable silent speech interface leveraging neck-worn photoplethysmogra- phy.IEEE Internet of Things Journal, 13(4):6692–6703, 2026

    Lingde Hu, Wenbo Zhang, Wenkang Zhang, Yu He, Seokmin Choi, Yang Gao, Jagmohan Chauhan, and Zhanpeng Jin. Ppgspeech: A wearable silent speech interface leveraging neck-worn photoplethysmogra- phy.IEEE Internet of Things Journal, 13(4):6692–6703, 2026. doi: 10.1109/JIOT.2025.3639152

  37. [38]

    Y . Hu, O. Stretcu, C.-T. Lu, K. Viswanathan, K. Hata, E. Luo, R. Krishna, and A. Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models.arXiv preprint arXiv:2312.03052, 2023

  38. [39]

    Jeong, Y

    J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Nadar, and O. Dabeer. Winclip: Zero-/few-shot anomaly classification and segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023

  39. [40]

    M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim. Visual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022

  40. [41]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  41. [42]

    C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli´c, and F. Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025

  42. [43]

    Factguard: Agentic video misinformation detection via reinforcement learning

    Zehao Li, Hongwei Yu, Hao Jiang, Qiang Sheng, Yilong Xu, Baolong Bi, Yang Li, Zhenlong Yuan, Yujun Cai, and Zhaoqi Wang. Factguard: Agentic video misinformation detection via reinforcement learning. arXiv preprint arXiv:2602.22963, 2026

  43. [44]

    ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

    Zixu Li, Yupeng Hu, Zhiwei Chen, Mingyu Zhang, Zhiheng Fu, and Liqiang Nie. Conesep: Cone- based robust noise-unlearning compositional network for composed image retrieval, 2026. URL https: //arxiv.org/abs/2604.20358

  44. [45]

    Habit: Chrono-synergia robust progressive learning framework for composed image retrieval

    Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. Habit: Chrono-synergia robust progressive learning framework for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6762–6770, 2026

  45. [46]

    TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

    Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li, and Liqiang Nie. Tema: Anchor the image, follow the text for multi-modification composed image retrieval, 2026. URL https://arxiv.org/ abs/2604.21806

  46. [47]

    A decoupling paradigm with prompt learning for remote sensing image change captioning.IEEE Transactions on Geoscience and Remote Sensing, 61:1–18, 2023

    Chenyang Liu, Rui Zhao, Jianqi Chen, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. A decoupling paradigm with prompt learning for remote sensing image change captioning.IEEE Transactions on Geoscience and Remote Sensing, 61:1–18, 2023

  47. [48]

    Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

    Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

  48. [49]

    Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, and Zhenwei Shi. Text2earth: Unlocking text- driven remote sensing image generation with a global-scale dataset and a foundation model.IEEE Geoscience and Remote Sensing Magazine, 13(3):238–259, 2025. doi: 10.1109/MGRS.2025.3560455

  49. [50]

    H. Liu, C. Li, Q. Wu, and Y .J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2024

  50. [51]

    Liu et al

    H. Liu et al. Multimodal segmentation with large vision-language models.arXiv preprint, 2025

  51. [52]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

  52. [53]

    Llava-next: Improved reasoning, ocr, and world knowledge

    Haotian Liu, Chunyuan Li, Yuheng Li, et al. Llava-next: Improved reasoning, ocr, and world knowledge. arXiv preprint, 2024

  53. [54]

    S. Liu, H. Cheng, H. Liu, H. Zhang, F. Li, T. Ren, X. Zou, J. Yang, H. Su, J. Zhu, L. Zhang, and J. Gao. Llava-plus: Learning to use tools for creating multimodal agents.arXiv preprint arXiv:2311.05437, 2023

  54. [55]

    End-to-end multi-modal diffusion mamba

    Chunhao Lu, Qiang Lu, Meichen Dong, and Jake Luo. End-to-end multi-modal diffusion mamba. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20529–20540, 2025

  55. [56]

    Instructhoi: Context-aware instruction for multi-modal reasoning in human-object interaction detection

    Jinguo Luo, Weihong Ren, Quanlong Zheng, Yanhao Zhang, Zhenlong Yuan, Zhiyong Wang, Haonan Lu, and Honghai Liu. Instructhoi: Context-aware instruction for multi-modal reasoning in human-object interaction detection. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 12

  56. [57]

    Z. Ma, J. Zhang, Z. Liu, J. Zhang, J. Tan, M. Shu, J. C. Niebles, S. Heinecke, H. Wang, C. Xiong, and R. Krishna. Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action. arXiv preprint arXiv:2412.05479, 2024

  57. [58]

    J. Miao, P. Du, Y . Fan, Y . Liu, Y . Wang, R. He, L. Huang, and Y . Wang. Agentiad: Agentic industrial anomaly detection via adaptive memory augmentation.arXiv preprint arXiv:2512.13671, 2025

  58. [59]

    H. Mu, J. Qiu, Y . Zheng, H. Qi, M. Chen, et al. Diffusionad: Norm-guided diffusion for anomaly detection. arXiv preprint arXiv:2303.08730, 2023

  59. [60]

    The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    OpenAI. Gpt-4v(ision) system card.arXiv preprint arXiv:2309.17421, 2023

  60. [61]

    Learning to reason with large language models

    OpenAI. Learning to reason with large language models. OpenAI Blog, 2024. URL https://openai. com/index/learning-to-reason-with-llms/

  61. [62]

    Hello gpt-4o.OpenAI Blog, 2024

    OpenAI. Hello gpt-4o.OpenAI Blog, 2024. URLhttps://openai.com/index/hello-gpt-4o/

  62. [63]

    Gpt-4.1 technical report.arXiv preprint, 2025

    OpenAI. Gpt-4.1 technical report.arXiv preprint, 2025

  63. [64]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, pages 27730–27744, 2022

  64. [65]

    Y . Peng, G. Zhang, M. Zhang, Z. You, J. Liu, Q. Zhu, K. Yang, X. Xu, X. Geng, and X. Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

  65. [66]

    Adaptive label correction for robust medical image segmentation with noisy labels.arXiv preprint arXiv:2503.12218, 2025

    Chengxuan Qian, Kai Han, Jianxia Ding, Chongwen Lyu, Zhenlong Yuan, Jun Chen, and Zhe Liu. Adaptive label correction for robust medical image segmentation with noisy labels.arXiv preprint arXiv:2503.12218, 2025

  66. [67]

    DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning

    Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, and Zhengzhong Tu. Decalign: Hierarchical cross- modal alignment for decoupled multimodal representation learning.arXiv preprint arXiv:2503.11892, 2025

  67. [68]

    Do large language models perform latent multi-hop reasoning without exploiting shortcuts? abs/2411.16679, 2024

    Xiangyan Qu, Zhenlong Yuan, Jing Tang, Rui Chen, Datao Tang, Meng Yu, Lei Sun, Yancheng Bai, Xiangxiang Chu, Gaopeng Gou, Gang Xiong, and Yujun Cai. From scale to speed: Adaptive test-time scaling for image editing.CoRR, abs/2603.00141, 2026. doi: 10.48550/ARXIV .2603.00141. URL https://doi.org/10.48550/arXiv.2603.00141

  68. [69]

    K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, and P. Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14318–14328, 2022

  69. [70]

    Rudolph, T

    M. Rudolph, T. Wehrbein, B. Rosenhahn, and B. Wandt. Fully convolutional cross-scale-flows for image-based defect detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1088–1097, 2022

  70. [71]

    Schick, J

    T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, et al. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2024

  71. [72]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  72. [73]

    Z. Shao, P. Wang, Q. Zhu, R. Zheng, Y . Zhao, M. Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  73. [74]

    Gpromptshield: Elevating resilience in graph prompt tuning against adversarial attacks

    Shuhan Song, Ping Li, Ming Dun, Maolei Huang, Huawei Cao, and Xiaochun Ye. Gpromptshield: Elevating resilience in graph prompt tuning against adversarial attacks. InThe Thirteenth International Conference on Learning Representations, 2025

  74. [75]

    Equipping graph autoencoders: Revisiting masking strategies from a robustness perspective

    Shuhan Song, Ping Li, Ming Dun, Yuan Zhang, Huawei Cao, and Xiaochun Ye. Equipping graph autoencoders: Revisiting masking strategies from a robustness perspective. InProceedings of the 2025 SIAM International Conference on Data Mining (SDM), pages 366–375. SIAM, 2025

  75. [76]

    Spmgae: Self-purified masked graph autoencoders release robust expression power.Neurocomputing, 611:128631, 2025

    Shuhan Song, Ping Li, Ming Dun, Yuan Zhang, Huawei Cao, and Xiaochun Ye. Spmgae: Self-purified masked graph autoencoders release robust expression power.Neurocomputing, 611:128631, 2025

  76. [77]

    Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025

    Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, and Guangrun Wang. Physical autoregressive model for robotic manipulation without action pretraining.arXiv preprint arXiv:2508.09822, 2025. 13

  77. [78]

    Human-centric open-future task discovery: Formulation, benchmark, and scalable tree-based search

    Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang, and Liang Lin. Human-centric open-future task discovery: Formulation, benchmark, and scalable tree-based search. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17724–17732, 2026

  78. [79]

    Video-coe: Reinforcing video event prediction via chain of events.arXiv preprint arXiv:2603.14935, 2026

    Qile Su, Jing Tang, Rui Chen, Lei Sun, and Xiangxiang Chu. Video-coe: Reinforcing video event prediction via chain of events.arXiv preprint arXiv:2603.14935, 2026

  79. [80]

    Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, et al. Aligning large multimodal models with factually augmented rlhf. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  80. [81]

    Tabernik, S

    D. Tabernik, S. Šela, J. Skvarˇc, and D. Skoˇcaj. Spatially-adaptive filter units for deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9388–9396, 2020

Showing first 80 references.